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Preface 


This book is aimed at the reader who wishes to gain a working knowledge of time 
series and forecasting methods as applied in economics, engineering and the natural 
and social sciences. Unlike our earlier book, Time Series: Theory and Methods, re- 
ferred to in the text as TSTM, this one requires only a knowledge of basic calculus, 
matrix algebra and elementary statistics at the level (for example) of Mendenhall, 
Wackerly and Scheaffer (1990). It is intended for upper-level undergraduate students 
and beginning graduate students. 

The emphasis is on methods and the analysis of data sets. The student version 
of the time series package ITSM2000, enabling the reader to reproduce most of the 
calculations in the text (and to analyze further data sets of the reader’s own choosing), 
is included on the CD-ROM which accompanies the book. The data sets used in the 
book are also included. The package requires an IBM-compatible PC operating under 
Windows 95, NT version 4.0, or a later version of either of these operating systems. 
The program ITSM can be run directly from the CD-ROM or installed on a hard disk 
as described at the beginning of Appendix D, where a detailed introduction to the 
package is provided. 

Very little prior familiarity with computing is required in order to use the computer 
package. Detailed instructions for its use are found in the on-line help files which 
are accessed, when the program ITSM is running, by selecting the menu option 
Help>Contents and selecting the topic of interest. Under the heading Data you 
will find information concerning the data sets stored on the CD-ROM. The book can 
also be used in conjunction with other computer packages for handling time series. 
Chapter 14 of the book by Venables and Ripley (1994) describes how to perform 
many of the calculations using S-plus. 

There are numerous problems at the end of each chapter, many of which involve 
use of the programs to study the data sets provided. 

To make the underlying theory accessible to a wider audience, we have stated 
some of the key mathematical results without proof, but have attempted to ensure 
that the logical structure of the development is otherwise complete. (References to 
proofs are provided for the interested reader.) 
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Preface 


Since the upgrade to ITSM2000 occurred after the first edition of this book 
appeared, we have taken the opportunity, in this edition, to coordinate the text with 
the new software, to make a number of corrections pointed out by readers of the first 
edition and to expand on several of the topics treated only briefly in the first edition. 

Appendix D, the software tutorial, has been rewritten in order to be compatible 
with the new version of the software. 

Some of the other extensive changes occur in (i) Section 6.6, which highlights 
the role of the innovations algorithm in generalized least squares and maximum 
likelihood estimation of regression models with time series errors, (ii) Section 6.4, 
where the treatment of forecast functions for ARIMA processes has been expanded 
and (iii) Section 10.3, which now includes GARCH modeling and simulation, topics 
of considerable importance in the analysis of financial time series. The new material 
has been incorporated into the accompanying software, to which we have also added 
the option Autofit. This streamlines the modeling of time series data by fitting 
maximum likelihood ARMA (p, q) models for a specified range of (p, q) values and 
automatically selecting the model with smallest AICC value. 

There is sufficient material here for a full-year introduction to univariate and mul- 
tivariate time series and forecasting. Chapters 1 through 6 have been used for several 
years in introductory one-semester courses in univariate time series at Colorado State 
University and Royal Melbourne Institute of Technology. The chapter on spectral 
analysis can be excluded without loss of continuity by readers who are so inclined. 

Weare greatly indebted to the readers of the first edition and especially to Matthew 
Calder, coauthor of the new computer package, and Anthony Brockwell for their 
many valuable comments and suggestions. We also wish to thank Colorado State 
University, the National Science Foundation, Springer-Verlag and our families for 
their continuing support during the preparation of this second edition. 


Fort Collins, Colorado Peter J. Brockwell 
August 2001 Richard A. Davis 
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Introduction 


1.1 Examples of Time Series 

1.2 Objectives of Time Series Analysis 

1.3. Some Simple Time Series Models 

1.4 Stationary Models and the Autocorrelation Function 

1.5 Estimation and Elimination of Trend and Seasonal Components 
1.6 Testing the Estimated Noise Sequence 


In this chapter we introduce some basic ideas of time series analysis and stochastic 
processes. Of particular importance are the concepts of stationarity and the autocovari- 
ance and sample autocovariance functions. Some standard techniques are described 
for the estimation and removal of trend and seasonality (of known period) from an 
observed time series. These are illustrated with reference to the data sets in Section 
1.1. The calculations in all the examples can be carried out using the time series pack- 
age ITSM, the student version of which is supplied on the enclosed CD. The data sets 
are contained in files with names ending in .TSM. For example, the Australian red 
wine sales are filed as WINE.TSM. Most of the topics covered in this chapter will 
be developed more fully in later sections of the book. The reader who is not already 
familiar with random variables and random vectors should first read Appendix A, 
where a concise account of the required background is given. 


1.1 Examples of Time Series 


A time series is a set of observations x,, each one being recorded at a specific time t. 
A discrete-time time series (the type to which this book is primarily devoted) is one 
in which the set Tọ of times at which observations are made is a discrete set, as is the 
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Figure 1-1 
The Australian red wine 
sales, Jan. ‘80 — Oct. ‘91. 


Example 1.1.1 


Example 1.1.2 
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case, for example, when observations are made at fixed time intervals. Continuous- 
time time series are obtained when observations are recorded continuously over some 
time interval, e.g., when Tọ = [0, 1]. 


Australian red wine sales; WINE.TSM 


Figure 1.1 shows the monthly sales (in kiloliters) of red wine by Australian winemak- 
ers from January 1980 through October 1991. In this case the set Tọ consists of the 
142 times {(Jan. 1980), (Feb. 1980), ...,(Oct. 1991)}. Given a set of n observations 
made at uniformly spaced time intervals, it is often convenient to rescale the time 
axis in such a way that Tọ becomes the set of integers {1, 2, ..., n}. In the present 
example this amounts to measuring time in months with (Jan. 1980) as month 1. Then 
To is the set {1,2,..., 142}. It appears from the graph that the sales have an upward 
trend and a seasonal pattern with a peak in July and a trough in January. To plot the 
data using ITSM, run the program by double-clicking on the ITSM icon and then 
select the option File>Project>Open>Univariate, click OK, and select the file 
WINE.TSM. The graph of the data will then appear on your screen. 


All-star baseball games, 1933-1995 
Figure 1.2 shows the results of the all-star games by plotting x,, where 
1 if the National League won in year t, 


= 
—1 ifthe American League won in year t. 
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Figure 1-2 
Results of the 
all-star baseball 
games, 1933-1995. 


Example 1.1.3 


Example 1.1.4 


cb co d d tod cod d to h T 


79401950 796018701880 ~~ 890 

This is a series with only two possible values, +1. It also has some missing values, 
since no game was played in 1945, and two games were scheduled for each of the 
years 1959-1962. 


Accidental deaths, U.S.A., 1973-1978; DEATHS.TSM 


Like the red wine sales, the monthly accidental death figures show a strong seasonal 
pattern, with the maximum for each year occurring in July and the minimum for each 
year occurring in February. The presence of a trend in Figure 1.3 is much less apparent 
than in the wine sales. In Section 1.5 we shall consider the problem of representing 
the data as the sum of a trend, a seasonal component, and a residual term. 


A signal detection problem; SIGNAL.TSM 
Figure 1.4 shows simulated values of the series 


X, = cos (=) +N,, t=1,2,..., 200, 
10 
where {N,} is a sequence of independent normal random variables, with mean 0 
and variance 0.25. Such a series is often referred to as signal plus noise, the signal 
being the smooth function, S, = cos(;5) in this case. Given only the data X,, how 
can we determine the unknown signal component? There are many approaches to 
this general problem under varying assumptions about the signal and the noise. One 
simple approach is to smooth the data by expressing X, as a sum of sine waves of 
various frequencies (see Section 4.2) and eliminating the high-frequency components. 
If we do this to the values of {X,} shown in Figure 1.4 and retain only the lowest 
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3.5% of the frequency components, we obtain the estimate of the signal also shown 
in Figure 1.4. The waveform of the signal is quite close to that of the true signal in 
this case, although its amplitude is somewhat smaller. 


Example 1.1.5 Population of the U.S.A., 1790-1990; USPOP.TSM 


The population of the U.S.A., measured at ten-year intervals, is shown in Figure 1.5. 
The graph suggests the possibility of fitting a quadratic or exponential trend to the 
data. We shall explore this further in Section 1.3. 


Example 1.1.6 Number of strikes per year in the U.S.A., 1951-1980; STRIKES.TSM 


The annual numbers of strikes in the U.S.A. for the years 1951—1980 are shown in 
Figure 1.6. They appear to fluctuate erratically about a slowly changing level. 


1.2 Objectives of Time Series Analysis 


The examples considered in Section 1.1 are an extremely small sample from the 
multitude of time series encountered in the fields of engineering, science, sociology, 
and economics. Our purpose in this book is to study techniques for drawing inferences 
from such series. Before we can do this, however, it is necessary to set up a hypothetical 
probability model to represent the data. After an appropriate family of models has 
been chosen, it is then possible to estimate parameters, check for goodness of fit to 
the data, and possibly to use the fitted model to enhance our understanding of the 
mechanism generating the series. Once a satisfactory model has been developed, it 
may be used in a variety of ways depending on the particular field of application. 
The model may be used simply to provide a compact description of the data. We 
may, for example, be able to represent the accidental deaths data of Example 1.1.3 as 
the sum of a specified trend, and seasonal and random terms. For the interpretation 
of economic statistics such as unemployment figures, it is important to recognize the 
presence of seasonal components and to remove them so as not to confuse them with 
long-term trends. This process is known as seasonal adjustment. Other applications 
of time series models include separation (or filtering) of noise from signals as in 
Example 1.1.4, prediction of future values of a series such as the red wine sales in 
Example 1.1.1 or the population data in Example 1.1.5, testing hypotheses such as 
global warming using recorded temperature data, predicting one series from obser- 
vations of another, e.g., predicting future sales using advertising expenditure data, 
and controlling future values of a series by adjusting parameters. Time series models 
are also useful in simulation studies. For example, the performance of a reservoir 
depends heavily on the random daily inputs of water to the system. If these are mod- 
eled as a time series, then we can use the fitted model to simulate a large number 
of independent sequences of daily inputs. Knowing the size and mode of operation 
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of the reservoir, we can determine the fraction of the simulated input sequences that 
cause the reservoir to run out of water in a given time period. This fraction will then 
be an estimate of the probability of emptiness of the reservoir at some time in the 
given period. 


1.3. Some Simple Time Series Models 


Definition 1.3.1 


An important part of the analysis of a time series is the selection of a suitable proba- 
bility model (or class of models) for the data. To allow for the possibly unpredictable 
nature of future observations it is natural to suppose that each observation x, is a 
realized value of a certain random variable X,. 


A time series model for the observed data {x,} is a specification of the joint 
distributions (or possibly only the means and covariances) of a sequence of random 
variables {X,} of which {x,} is postulated to be a realization. 


Remark. We shall frequently use the term time series to mean both the data and 
the process of which it is a realization. 


A complete probabilistic time series model for the sequence of random vari- 
ables {X,, X2, ...} would specify all of the joint distributions of the random vectors 
(X1,..., Xa), n = 1,2,..., or equivalently all of the probabilities 


P[X, <x1,...,Xn <%], -OO <X,...,%, <0, n=1l,2,.... 


Such a specification is rarely used in time series analysis (unless the data are generated 
by some well-understood simple mechanism), since in general it will contain far too 
many parameters to be estimated from the available data. Instead we specify only the 
first- and second-order moments of the joint distributions, i.e., the expected values 
EX, and the expected products E(X,4,X;),t = 1,2,...,4 =0,1,2,..., focusing 
on properties of the sequence {X,} that depend only on these. Such properties of {X,} 
are referred to as second-order properties. In the particular case where all the joint 
distributions are multivariate normal, the second-order properties of {X,} completely 
determine the joint distributions and hence give a complete probabilistic characteri- 
zation of the sequence. In general we shall lose a certain amount of information by 
looking at time series “through second-order spectacles”; however, as we shall see 
in Chapter 2, the theory of minimum mean squared error linear prediction depends 
only on the second-order properties, thus providing further justification for the use 
of the second-order characterization of time series models. 

Figure 1.7 shows one of many possible realizations of {S,,t = 1,..., 200}, where 
{S,} is a sequence of random variables specified in Example 1.3.3 below. In most 
practical problems involving time series we see only one realization. For example, 
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Example 1.3.1 


Example 1.3.2 


there is only one available realization of Fort Collins’s annual rainfall for the years 
1900-1996, but we imagine it to be one of the many sequences that might have 
occurred. In the following examples we introduce some simple time series models. 
One of our goals will be to expand this repertoire so as to have at our disposal a broad 
range of models with which to try to match the observed behavior of given data sets. 


1.3.1 Some Zero-Mean Models 


iid noise 

Perhaps the simplest model for a time series is one in which there is no trend or 
seasonal component and in which the observations are simply independent and iden- 
tically distributed (iid) random variables with zero mean. We refer to such a sequence 
of random variables X1, X2,... as iid noise. By definition we can write, for any 
positive integer n and real numbers x), ..., Xn, 


P[X, < x1, ..., Xn < Xn] = PIX; < x1] PIX, < xn] = F x1) Fn), 


where F(-) is the cumulative distribution function (see Section A.1) of each of 

the identically distributed random variables X,, X2,.... In this model there is no 

dependence between observations. In particular, for all h > 1 and all x, x;,..., Xn, 
P[Xn+h < x|Xy SX, +605 Xn = Xn] = P[Xnth < x], 


showing that knowledge of X,,..., X, is of no value for predicting the behavior 
of Xn+,. Given the values of X,,..., Xn, the function f that minimizes the mean 
squared error E [Xnr — f(X%,..., X,))"] is in fact identically zero (see Problem 
1.2). Although this means that iid noise is a rather uninteresting process for forecast- 
ers, it plays an important role as a building block for more complicated time series 
models. 


A binary process 


As an example of iid noise, consider the sequence of iid random variables {X,, t = 
1,2,...,} with 


P[X,;=1)=p, P[X;=—1])=1-—p, 


where p = L, The time series obtained by tossing a penny repeatedly and scoring 


+1 for each head and —1 for each tail is usually modeled as a realization of this 
process. A priori we might well consider the same process as a model for the all-star 
baseball games in Example 1.1.2. However, even a cursory inspection of the results 
from 1963-1982, which show the National League winning 19 of 20 games, casts 
serious doubt on the hypothesis P[X, = 1] = L, 
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Example 1.3.3 


Figure 1-7 

One realization of a 
simple random walk 
{S t =0,1,2,..., BOO! 


Random walk 


The random walk {S,, t = 0, 1, 2, . . .} (starting at zero) is obtained by cumulatively 
summing (or “integrating”) iid random variables. Thus a random walk with zero mean 
is obtained by defining Sọ = 0 and 


S =X +X + +X, fort=—1,2,..., 


where {X,} is iid noise. If {X,} is the binary process of Example 1.3.2, then {S,, t = 
0, 1,2,...,}is called a simple symmetric random walk. This walk can be viewed 
as the location of a pedestrian who starts at position zero at time zero and at each 
integer time tosses a fair coin, stepping one unit to the right each time a head appears 
and one unit to the left for each tail. A realization of length 200 of a simple symmetric 
random walk is shown in Figure 1.7. Notice that the outcomes of the coin tosses can 
be recovered from {S,, t = 0, 1, ...} by differencing. Thus the result of the tth toss 
can be found from S, — S,_; = X;. 


1.3.2 Models with Trend and Seasonality 


In several of the time series examples of Section 1.1 there is a clear trend in the data. 
An increasing trend is apparent in both the Australian red wine sales (Figure 1.1) 
and the population of the U.S.A. (Figure 1.5). In both cases a zero-mean model for 
the data is clearly inappropriate. The graph of the population data, which contains no 
apparent periodic component, suggests trying a model of the form 


X: =m +Y,, 
2 rn F | i D g fil AN it 
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where m, is a slowly changing function known as the trend component and Y, has 
zero mean. A useful technique for estimating m, is the method of least squares (some 
other methods are considered in Section 1.5). 
In the least squares procedure we attempt to fit a parametric family of functions, 
e.g., 
m, = dy + at + ant’, (1.3.1) 


to the data {x1, . . . , Xn} by choosing the parameters, in this illustration ao, a1, and a2, to 
minimize )~"_, (x; —m,)*. This method of curve fitting is called least squares regres- 
sion and can be carried out using the program ITSM and selecting the Regression 
option. 


Population of the U.S.A., 1790-1990 


To fit a function of the form (1.3.1) to the population data shown in Figure 1.5 we 
relabel the time axis so that f = 1 corresponds to 1790 and t = 21 corresponds to 
1990. Run ITSM, select File>Project>Open>Univariate, and open the file US- 
POP.TSM. Then select Regression>Specify, choose Polynomial Regression 
with order equal to 2, and click OK. Then select Regression>Estimation>Least 
Squares, and you will obtain the following estimated parameter values in the model 
(1.3.1): 


ĉo = 6.9579 x 10°, 

a, = —2.1599 x 10°, 
and 

â, = 6.5063 x 10°. 


A graph of the fitted function is shown with the original data in Figure 1.8. The 
estimated values of the noise process Y,, 1 < t < 21, are the residuals obtained by 
subtraction of m, = do + dit + dt? from x. 

The estimated trend component ñ, furnishes us with a natural predictor of future 
values of X,. For example, if we estimate the noise Y» by its mean value, i.e., zero, 
then (1.3.1) gives the estimated U.S. population for the year 2000 as 


Mo = 6.9579 x 10° — 2.1599 x 10° x 22 + 6.5063 x 10° x 22? = 274.35 x 10°. 


However, if the residuals {Y,} are highly correlated, we may be able to use their values 
to give a better estimate of Y2) and hence of the population X22 in the year 2000. 


Level of Lake Huron 1875-1972; LAKE.DAT 


A graph of the level in feet of Lake Huron (reduced by 570) in the years 1875-1972 
is displayed in Figure 1.9. Since the lake level appears to decline at a roughly linear 
rate, ITSM was used to fit a model of the form 


X,=atatt+yY,, t=1,...,98 (1.3.2) 
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Figure 1-8 

Population of the U.S.A. 
showing the quadratic trend 
fitted by least squares. 


Figure 1-9 

Level of Lake Huron 
1875-1972 showing the 
line fitted by least squares. 
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(with the time axis relabeled as in Example 1.3.4). The least squares estimates of the 
parameter values are 


â = 10.202 and a, = —.0242. 


(The resulting least squares line, dy + dit, is also displayed in Figure 1.9.) The 
estimates of the noise, Y,, in the model (1.3.2) are the residuals obtained by subtracting 
the least squares line from x, and are plotted in Figure 1.10. There are two interesting 
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features of the graph of the residuals. The first is the absence of any discernible trend. 
The second is the smoothness of the graph. (In particular, there are long stretches of 
residuals that have the same sign. This would be very unlikely to occur if the residuals 
were observations of iid noise with zero mean.) Smoothness of the graph of a time 
series is generally indicative of the existence of some form of dependence among the 
observations. 

Such dependence can be used to advantage in forecasting future values of the 
series. If we were to assume the validity of the fitted model with iid residuals {Y,}, then 
the minimum mean squared error predictor of the next residual (Yo9) would be zero 
(by Problem 1.2). However, Figure 1.10 strongly suggests that Yo will be positive. 

How then do we quantify dependence, and how do we construct models for fore- 
casting that incorporate dependence of a particular type? To deal with these questions, 
Section 1.4 introduces the autocorrelation function as a measure of dependence, and 
stationary processes as a family of useful models exhibiting a wide variety of depen- 
dence structures. 


Harmonic Regression 

Many time series are influenced by seasonally varying factors such as the weather, the 
effect of which can be modeled by a periodic component with fixed known period. For 
example, the accidental deaths series (Figure 1.3) shows a repeating annual pattern 
with peaks in July and troughs in February, strongly suggesting a seasonal factor 
with period 12. In order to represent such a seasonal effect, allowing for noise but 
assuming no trend, we can use the simple model, 


X,=58,+Y,, 
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Figure 1-11 

The estimated harmonic 
component of the 
accidental deaths 

data from ITSM. 


where s, is a periodic function of t with period d (s-a = s,). A convenient choice for 
Sı is a sum of harmonics (or sine waves) given by 


k 
s: = ao + X (aj cos(a;t) + bj sin(jt)), (1.3.3) 
j=l 
where ao, a1,..., a, and b;,..., by are unknown parameters and A), ..., Ax are fixed 
frequencies, each being some integer multiple of 277/d. To carry out harmonic re- 
gression using ITSM, select Regression>Specify and check Include intercept 
term and Harmonic Regression. Then specify the number of harmonics (k in 
(1.3.3)) and enter k integer-valued Fourier indices fi, ..., fx. For a sine wave with 
period d, set fi = n/d, where n is the number of observations in the time series. (If 
n/d is not an integer, you will need to delete a few observations from the beginning 
of the series to make it so.) The other k — 1 Fourier indices should be positive integer 
multiples of the first, corresponding to harmonics of the fundamental sine wave with 
period d. Thus to fit a single sine wave with period 365 to 365 daily observations we 
would choose k = 1 and fı = 1. To fit a linear combination of sine waves with periods 
365/j, j =1,...,4, we would choose k = 4 and f; = j, j =1,...,4. Once k and 
fi, ---, fk have been specified, click OK and then select Regression>Estimation 
>Least Squares to obtain the required regression coefficients. To see how well the 
fitted function matches the data, select Regression>Show fit. 


Accidental deaths 


To fit a sum of two harmonics with periods twelve months and six months to the 
monthly accidental deaths data x,,...,x, with n = 72, we choose k = 2, fi = 
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n/12 = 6, and fz = n/6 = 12. Using ITSM as described above, we obtain the fitted 
function shown in Figure 1.11. As can be seen from the figure, the periodic character 
of the series is captured reasonably well by this fitted function. In practice, it is worth 
experimenting with several different combinations of harmonics in order to find a sat- 
isfactory estimate of the seasonal component. The program ITSM also allows fitting 
a linear combination of harmonics and polynomial trend by checking both Harmonic 
Regression and Polynomial Regression in the Regression>Specification 
dialog box. Other methods for dealing with seasonal variation in the presence of 
trend are described in Section 1.5. 


1.3.3 A General Approach to Time Series Modeling 


The examples of the previous section illustrate a general approach to time series 
analysis that will form the basis for much of what is done in this book. Before 
introducing the ideas of dependence and stationarity, we outline this approach to 
provide the reader with an overview of the way in which the various ideas of this 
chapter fit together. 


e Plot the series and examine the main features of the graph, checking in particular 
whether there is 
(a) a trend, 
(b) a seasonal component, 
(c) any apparent sharp changes in behavior, 
(d) any outlying observations. 


e Remove the trend and seasonal components to get stationary residuals (as defined 
in Section 1.4). To achieve this goal it may sometimes be necessary to apply a 
preliminary transformation to the data. For example, if the magnitude of the 
fluctuations appears to grow roughly linearly with the level of the series, then 
the transformed series {In X,,..., In X,,} will have fluctuations of more constant 
magnitude. See, for example, Figures 1.1 and 1.17. (If some of the data are 
negative, add a positive constant to each of the data values to ensure that all 
values are positive before taking logarithms.) There are several ways in which 
trend and seasonality can be removed (see Section 1.5), some involving estimating 
the components and subtracting them from the data, and others depending on 
differencing the data, i.e., replacing the original series {X,} by {Y, := X, — X;-a} 
for some positive integer d. Whichever method is used, the aim is to produce a 
stationary series, whose values we shall refer to as residuals. 


e Choose a model to fit the residuals, making use of various sample statistics in- 
cluding the sample autocorrelation function to be defined in Section 1.4. 


e Forecasting will be achieved by forecasting the residuals and then inverting the 
transformations described above to arrive at forecasts of the original series {X;}. 
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e An extremely useful alternative approach touched on only briefly in this book is 
to express the series in terms of its Fourier components, which are sinusoidal 
waves of different frequencies (cf. Example 1.1.4). This approach is especially 
important in engineering applications such as signal processing and structural 
design. It is important, for example, to ensure that the resonant frequency of a 
structure does not coincide with a frequency at which the loading forces on the 
structure have a particularly large component. 


1.4 Stationary Models and the Autocorrelation Function 


Loosely speaking, a time series {X,, t = 0, £1, ...} is said to be stationary if it has sta- 
tistical properties similar to those of the “time-shifted” series {X,.,,¢f =0,+1,...}, 
for each integer h. Restricting attention to those properties that depend only on the 
first- and second-order moments of {X;}, we can make this idea precise with the 
following definitions. 


Definition 1.4.1 Let {X,} be a time series with E(X?) < oo. The mean function of {X,} is 
x(t) = E(X,). 

The covariance function of {X;} is 
yx(r, s) = Cov(X,, Xs) = ERX, — ux (r) (Xs — ux(s))] 


for all integers r and s. 


Definition 1.4.2 {X,} is (weakly) stationary if 
(i) x(t) is independent of t, 
and 


(ii) yy(t + h, t) is independent of t for each h. 


Remark 1. Strict stationarity of a time series {X,, t = 0, +1, ...} is defined by the 
condition that (X,,..., X,) and (Xi4;,..., Xn4,) have the same joint distributions 
for all integers h and n > 0. It is easy to check that if {X,} is strictly stationary and 
EX? < œ for all r, then {X;} is also weakly stationary (Problem 1.3). Whenever we 
use the term stationary we shall mean weakly stationary as in Definition 1.4.2, unless 
we specifically indicate otherwise. 


Remark 2. In view of condition (ii), whenever we use the term covariance function 
with reference to a stationary time series {X,} we shall mean the function yy of one 
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Example 1.4.2 


variable, defined by 
yx (A) := yx(h, 0) = yx(t +h, t). 


The function yy (-) will be referred to as the autocovariance function and 7x (A) as its 
value at lag h. 


Let {X,} be a stationary time series. The autocovariance function (ACVF) of 
{X,} at lag h is 


yx(h) = Cov(Xi4n, X+). 


The autocorrelation function (ACF) of {X,} at lag h is 


yx(h) 
yx (0) 


px(h) = = Cor(X;4n, X+). 


In the following examples we shall frequently use the easily verified linearity prop- 
erty of covariances, that if EX? < œ, EY? < œo, EZ? < oo and a, b, and c are any 
real constants, then 


Cov(aX + bY + c, Z) = aCov(X, Z) + bCov(Y, Z). 
iid noise 
If {X,} is iid noise and E(X?) = o? < œ, then the first requirement of Definition 
1.4.2 is obviously satisfied, since E(X,) = 0 for all t. By the assumed independence, 
o°, ifh=0, 
0, ifh 40, 


which does not depend on t. Hence iid noise with finite second moment is stationary. 
We shall use the notation 


{X,} ~ IID (0, o°) 


yx(t +h,t)= 


to indicate that the random variables X, are independent and identically distributed 
random variables, each with mean 0 and variance o?. 


White noise 


If {X,} is a sequence of uncorrelated random variables, each with zero mean and 
variance o°, then clearly {X,} is stationary with the same covariance function as the 
iid noise in Example 1.4.1. Such a sequence is referred to as white noise (with mean 
0 and variance o”). This is indicated by the notation 


{X,} ~ WN (0, 0”). 
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Example 1.4.3 


Example 1.4.4 


Example 1.4.5 


Clearly, every IID (0, o?) sequence is WN (0, o°?) but not conversely (see Problem 1.8 
and the ARCH(1) process of Section 10.3). 


The random walk 


If {S,} is the random walk defined in Example 1.3.3 with {X,} as in Example 1.4.1, 
then ES, = 0, E(S?) = to? < oo for all t, and, for h > 0, 


ys(t +h, t) = Cov(S:4n, Sy) 
= Cov(S, + X p44 tere fp Xith, S,) 
= Cov(S;, S;) 


= to’. 


Since ys(t + h, t) depends on f, the series {S,} is not stationary. 


First-order moving average or MA(1) process 
Consider the series defined by the equation 
X,=Z;+0Z,1, t=0,+1,..., (1.4.1) 


where {Z,} ~ WN (0, o?) and 0 is a real-valued constant. From (1.4.1) we see that 
EX, = 0, EX? = o° (1 + 67) < œ, and 


o?’ (1+60°), ifh=0, 
yx(t +h,t)= į 0°09, if h = 1, 
0, if |h| > 1. 


Thus the requirements of Definition 1.4.2 are satisfied, and {X,} is stationary. The 
autocorrelation function of {X,} is 


1, ifh = 0, 
px(h) = 4 0/(1+0°), ifh= +1, 
0, if |h| > 1. 


First-order autoregression or AR(1) process 
Let us assume now that {X,} is a stationary series satisfying the equations 
X: =X +Z, t=0,+1,..., (1.4.2) 


where {Z,;} ~ WN(0, o°), |ġ| < 1, and Z, is uncorrelated with X, for each s < t. (We 
shall show in Section 2.2 that there is in fact exactly one such solution of (1.4.2).) By 
taking expectations on each side of (1.4.2) and using the fact that EZ, = 0, we see 
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at once that 
EX, =0. 


To find the autocorrelation function of {X,} we multiply each side of (1.4.2) by X;_» 
(h > 0) and then take expectations to get 


yx(h) = Cov(X,, Xn) 
= Cov(oX;-1, Xin) + Cov(Z;, Xin) 
= dyx(h—-1)+0=---=¢"y, (0). 


Observing that y (h) = y(—A) and using Definition 1.4.3, we find that 


ah pp FH oe 


vx (0) 


It follows from the linearity of the covariance function in each of its arguments and 
the fact that Z, is uncorrelated with X,_, that 


yx(0) = Cov(X,, X) = Cov(oX;-1 + Zi, X1 + Zi) = ¢’yx(0) +0? 


and hence that yx (0) = o?/ (1 — ¢°). 


1.4.1 The Sample Autocorrelation Function 


Although we have just seen how to compute the autocorrelation function for a few 
simple time series models, in practical problems we do not start with a model, but 
with observed data {x,, X2,..., Xn}. To assess the degree of dependence in the data 
and to select a model for the data that reflects this, one of the important tools we 
use is the sample autocorrelation function (sample ACF) of the data. If we believe 
that the data are realized values of a stationary time series {X,}, then the sample 
ACF will provide us with an estimate of the ACF of {X,}. This estimate may suggest 
which of the many possible stationary time series models is a suitable candidate for 
representing the dependence in the data. For example, a sample ACF that is close 
to zero for all nonzero lags suggests that an appropriate model for the data might 
be iid noise. The following definitions are natural sample analogues of those for the 
autocovariance and autocorrelation functions given earlier for stationary time series 
models. 
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Figure 1-12 
200 simulated values 
of iid N(0,1) noise. 


1 n 
I= Fa 
n i 
The sample autocovariance function is 


n—|h| 


Ph) =n! San — DOr -¥), <n <h<n. 


t=1 


The sample autocorrelation function is 


_ yh) 


= 2, -n<h<«<n. 
y (0) 


ph) 


Let x1, .. . , Xn be observations of a time series. The sample mean of x1, ..., Xn iS 


Remark 3. For h > 0, (h) is approximately equal to the sample covariance of 


the n — h pairs of observations (x1, X14), (X2, X241), +++ (Xn-h, Xn). The difference 
arises from use of the divisor n instead of n — h and the subtraction of the overall 
mean, x, from each factor of the summands. Use of the divisor n ensures that the 
sample covariance matrix Î, := ACD) 4 j=1 is nonnegative definite (see Section 


2.4.2). 


Remark 4. Like the sample covariance matrix defined in Remark 3, the sample 
correlation matrix R, := [o(@ — j)]; j=1 İS nonnegative definite. Each of its diagonal 


elements is equal to 1, since 6(0) = 1. 
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Figure 1-13 

The sample autocorrelation 
function for the data of 
Figure 1.12 showing 

the bounds +1.96/./n. 


Figure 1.12 shows 200 simulated values of normally distributed iid (0, 1), denoted 
by IID N(O, 1), noise. Figure 1.13 shows the corresponding sample autocorrelation 
function at lags 0,1,...,40. Since p(h) = 0 for h > 0, one would also expect the 
corresponding sample autocorrelations to be near 0. It can be shown, in fact, that for iid 
noise with finite variance, the sample autocorrelations ô (A), h > 0, are approximately 
IID N(O, 1/n) for n large (see TSTM p. 222). Hence, approximately 95% of the 
sample autocorrelations should fall between the bounds +1.96/,/n (since 1.96 is 
the .975 quantile of the standard normal distribution). Therefore, in Figure 1.13 we 
would expect roughly 40(.05) = 2 values to fall outside the bounds. To simulate 200 
values of IID N(O, 1) noise using ITSM, select File>Project>New>Univariate 
then Model>Simulate. In the resulting dialog box, enter 200 for the required Number 
of Observations. (The remaining entries in the dialog box can be left as they are, 
since the model assumed by ITSM, until you enter another, is IID N(0, 1) noise. If 
you wish to reproduce exactly the same sequence at a later date, record the Random 
Number Seed for later use. By specifying different values for the random number 
seed you can generate independent realizations of your time series.) Click on OK and 
you will see the graph of your simulated series. To see its sample autocorrelation 
function together with the autocorrelation function of the model that generated it, 
click on the third yellow button at the top of the screen and you will see the two 
graphs superimposed (with the latter in red.) The horizontal lines on the graph are 
the bounds +1.96/,/n. 


Remark 5. The sample autocovariance and autocorrelation functions can be com- 
puted for any data set {x,,...,x,} and are not restricted to observations from a 
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Figure 1-14 

The sample autocorrelation 
function for the Australian 
red wine sales showing 

the bounds +1.96/./n. 


1.0 


0.8 
T 


ACF 


stationary time series. For data containing a trend, |ò (A)| will exhibit slow decay as 
h increases, and for data with a substantial deterministic periodic component, |6(/)| 
will exhibit similar behavior with the same periodicity. (See the sample ACF of the 
Australian red wine sales in Figure 1.14 and Problem 1.9.) Thus 6(-) can be useful 
as an indicator of nonstationarity (see also Section 6.1). 


1.4.2 A Model for the Lake Huron Data 


As noted earlier, an iid noise model for the residuals {y1, ..., yog} obtained by fitting 
a Straight line to the Lake Huron data in Example 1.3.5 appears to be inappropriate. 
This conclusion is confirmed by the sample ACF of the residuals (Figure 1.15), which 
has three of the first forty values well outside the bounds +1.96/ V98. 

The roughly geometric decay of the first few sample autocorrelations (with 
p(h + 1)/p(h) & 0.7) suggests that an AR(1) series (with ¢ ~ 0.7) might pro- 
vide a reasonable model for these residuals. (The form of the ACF for an AR(1) 
process was computed in Example 1.4.5.) 

To explore the appropriateness of such a model, consider the points (91, y2), 
(Y2, Y3), - - -, (V97, Yog) plotted in Figure 1.16. The graph does indeed suggest a linear 
relationship between y, and y,_;. Using simple least squares estimation to fit a straight 
line of the form y; = ay,_;, we obtain the model 


Y, = .791Y,-; + Z;, (1.4.3) 


where {Z,} is iid noise with variance TE O — .791y,-1)?/97 = .5024. The sample 
ACF of the estimated noise sequence z, = y, — .791y,—1,t = 2,..., 98, is slightly 
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Figure 1-15 

The sample autocorrelation 
function for the Lake 
Huron residuals of 

Figure 1.10 showing 

the bounds £1.96//n. 


Figure 1-16 

Scatter plot of 

(Y-1, Ye), t = 2,...,98, 
for the data in Figure 1.10 
showing the least squares 
regression line y = .791x. 
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outside the bounds +1.96/ /97 at lag 1 (6(1) = .216), but it is inside the bounds for 
all other lags up to 40. This check that the estimated noise sequence is consistent with 
the iid assumption of (1.4.3) reinforces our belief in the fitted model. More goodness 
of fit tests for iid noise sequences are described in Section 1.6. The estimated noise 
sequence {z,} in this example passes them all, providing further support for the model 
(1.4.3). 
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A better fit to the residuals in equation (1.3.2) is provided by the second-order 
autoregression 


Y, = hY, + QY,- + Z,, (1.4.4) 


where {Z,} is iid noise with variance o°. This is analogous to a linear model in 
which Y, is regressed on the previous two values Y,_; and Y,—2 of the time series. The 
least squares estimates of the parameters ¢; and ¢2, found by minimizing paar Y: — 
b1y:-1 — 2yı-2)?, are d; = 1.002 and ¢. = —.2834. The estimate of o? is 62 = 
Gs — iYi — d2y;-2)°/96 = .4460, which is approximately 11% smaller than 
the estimate of the noise variance for the AR(1) model (1.4.3). The improved fit is 
indicated by the sample ACF of the estimated residuals, y, — ĝi Y1 — dy y,;-2, which 
falls well within the bounds +1.96//96 for all lags up to 40. 


1.5 Estimation and Elimination of Trend and Seasonal Components 


The first step in the analysis of any time series is to plot the data. If there are any 
apparent discontinuities in the series, such as a sudden change of level, it may be 
advisable to analyze the series by first breaking it into homogeneous segments. If 
there are outlying observations, they should be studied carefully to check whether 
there is any justification for discarding them (as for example if an observation has 
been incorrectly recorded). Inspection of a graph may also suggest the possibility 
of representing the data as a realization of the process (the classical decomposition 
model) 


X,=m,+s5,+ Y;, (1.5.1) 


where m, is a slowly changing function known as a trend component, s, is a function 
with known period d referred to as a seasonal component, and Y, is a random noise 
component that is stationary in the sense of Definition 1.4.2. If the seasonal and noise 
fluctuations appear to increase with the level of the process, then a preliminary trans- 
formation of the data is often used to make the transformed data more compatible 
with the model (1.5.1). Compare, for example, the red wine sales in Figure 1.1 with 
the transformed data, Figure 1.17, obtained by applying a logarithmic transformation. 
The transformed data do not exhibit the increasing fluctuation with increasing level 
that was apparent in the original data. This suggests that the model (1.5.1) is more 
appropriate for the transformed than for the original series. In this section we shall 
assume that the model (1.5.1) is appropriate (possibly after a preliminary transfor- 
mation of the data) and examine some techniques for estimating the components m,, 
s+, and Y, in the model. 

Our aim is to estimate and extract the deterministic components m, and s; in 
the hope that the residual or noise component Y, will turn out to be a stationary time 
series. We can then use the theory of such processes to find a satisfactory probabilistic 
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Figure 1-17 
The natural logarithms 
of the red wine data. 
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model for the process Y,, to analyze its properties, and to use it in conjunction with 
m, and s, for purposes of prediction and simulation of {X,}. 

Another approach, developed extensively by Box and Jenkins (1976), is to apply 
differencing operators repeatedly to the series {X,} until the differenced observations 
resemble a realization of some stationary time series {W,}. We can then use the theory 
of stationary processes for the modeling, analysis, and prediction of {W,} and hence 
of the original process. The various stages of this procedure will be discussed in detail 
in Chapters 5 and 6. 

The two approaches to trend and seasonality removal, (1) by estimation of m, 
and s, in (1.5.1) and (2) by differencing the series {X,}, will now be illustrated with 
reference to the data introduced in Section 1.1. 


1.5.1 Estimation and Elimination of Trend in the Absence of Seasonality 


In the absence of a seasonal component the model (1.5.1) becomes the following. 


Nonseasonal Model with Trend: 


X,=m,+Y, t=1,...5n, (1.5.2) 


where EY, = 0. 


(If EY, + 0, then we can replace m, and Y, in (1.5.2) with m, + EY, and Y, — EY,, 
respectively.) 


Estimation and Elimination of Trend and Seasonal Components 25 


Example 1.5.1 


Method 1: Trend Estimation 

Moving average and spectral smoothing are essentially nonparametric methods for 
trend (or signal) estimation and not for model building. Special smoothing filters can 
also be designed to remove periodic components as described under Method S1 below. 
The choice of smoothing filter requires a certain amount of subjective judgment, and 
it is recommended that a variety of filters be tried in order to get a good idea of the 
underlying trend. Exponential smoothing, since it is based on a moving average of 
past values only, is often used for forecasting, the smoothed value at the present time 
being used as the forecast of the next value. 

To construct a model for the data (with no seasonality) there are two general 
approaches, both available in ITSM. One is to fit a polynomial trend (by least squares) 
as described in Method 1(d) below, then to subtract the fitted trend from the data and 
to find an appropriate stationary time series model for the residuals. The other is 
to eliminate the trend by differencing as described in Method 2 and then to find an 
appropriate stationary model for the differenced series. The latter method has the 
advantage that it usually requires the estimation of fewer parameters and does not 
rest on the assumption of a trend that remains fixed throughout the observation period. 
The study of the residuals (or of the differenced series) is taken up in Section 1.6. 

(a) Smoothing with a finite moving average filter. Let q be a nonnegative 
integer and consider the two-sided moving average 


q 
W, = Qq +D) >> Xoj (1.5.3) 


j=—4 


of the process {X,} defined by (1.5.2). Then frq +1 <t <n-q, 


4 4 
Wi = 2g +17! J mjt Qa +D J Yoj ~m, (1.5.4) 
j=-4 IS 
assuming that m, is approximately linear over the interval [t — q, t + q] and that the 
average of the error terms over this interval is close to zero (see Problem 1.11). 
The moving average thus provides us with the estimates 


q 
fu = qa +D È Xj, q+1<t<n-q. (1.5.5) 
J=-4 
Since X, is not observed for t < O or t > n, we cannot use (1.5.5) for t < q or 
t > n — q. The program ITSM deals with this problem by defining X, := X, for 
t < land X, := X, fort >n. 


The result of applying the moving-average filter (1.5.5) with q = 2 to the strike data of 
Figure 1.6 is shown in Figure 1.18. The estimated noise terms Ê, = X,;—m, are shown 
in Figure 1.19. As expected, they show no apparent trend. To apply this filter using 
ITSM, open the project STRIKES.TSM, select Smooth>Moving Average, specify 
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Figure 1-18 

Simple 5-term moving 
average M of the strike 
data from Figure 1.6. 


Figure 1-19 
Residuals Y, = X — mm 
after subtracting the 
5-term moving average 
from the strike data 


(thousands) 


2 for the filter order, and enter the weights 1,1,1 for Theta(0), Theta(1), and Theta(2) 
(these are automatically normalized so that the sum of the weights is one). Then click 
OK. 


Itis useful to think of {m,}in (1.5.5) as a process obtained from {X,} by application 
of a linear operator or linear filter m, = D a,X,—; with weights aj = (2q + 


1)7!, —q < j < q. This particular filter is a low-pass filter in the sense that it takes the 
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Figure 1-20 
Smoothing with a 
low-pass linear filter. 


{xz} {ti = $ aj24-5} 


Linear Filter 


data {X,} and removes from it the rapidly fluctuating (or high frequency) component 
{Y,} to leave the slowly varying estimated trend term {m,} (see Figure 1.20). 

The particular filter (1.5.5) is only one of many that could be used for smoothing. 
For large q, provided that (2g + 1)~! X $__; Y;_; ~ 0, it not only will attenuate 
noise but at the same time will allow linear trend functions m; = co + cıt to pass 
without distortion (see Problem 1.11). However, we must beware of choosing q to 
be too large, since if m, is not linear, the filtered process, although smooth, will not 
be a good estimate of m,. By clever choice of the weights {a;} it is possible (see 
Problems 1.12—1.14 and Section 4.3) to design a filter that will not only be effective 
in attenuating noise in the data, but that will also allow a larger class of trend functions 
(for example all polynomials of degree less than or equal to 3) to pass through without 
distortion. The Spencer 15-point moving average is a filter that passes polynomials 
of degree 3 without distortion. Its weights are 


aj=0, |jl>7, 
with 
IJl <7, 


aj = a_j;, 


and 


1 
lao, ai, ... , a7] = z0 4 67, 46, 21,3, —5, —6, —3]. (1.5.6) 


Applied to the process (1.5.2) with m, = co + cit + cot? + c3t?, it gives 


7 F 


7 7 
) ajX;-j = J ajmy—j + ) ajY j © ) ajm;—j = Mz, 


j=-7 j=-7 j=-7 j=-7 


where the last step depends on the assumed form of m, (Problem 1.12). Further details 
regarding this and other smoothing filters can be found in Kendall and Stuart (1976), 


Chapter 46. 
(b) Exponential smoothing. For any fixed a e€ [0, 1], the one-sided moving 
averages m,,t = 1,...,n, defined by the recursions 
M =4X, + (1-a) i, t=2,...,n, (1.5.7) 
and 


ñi = Xi (1.5.8) 
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Figure 1-21 
Exponentially smoothed 
strike data with a = 0.4. 


can be computed using ITSM by selecting Smooth>Exponential and specifying 
the value of a. Application of (1.5.7) and (1.5.8) is often referred to as exponential 
smoothing, since the recursions imply that for t > 2, M, = pa a(l —a@)/X,_j; + 
(1 — œ)! X], a weighted moving average of X,, X;_1,..., with weights decreasing 
exponentially (except for the last one). 

(c) Smoothing by elimination of high-frequency components. The option 
Smooth>FFT in the program ITSM allows us to smooth an arbitrary series by elimi- 
nation of the high-frequency components of its Fourier series expansion (see Section 
4.2). This option was used in Example 1.1.4, where we chose to retain the fraction 
f = .035 of the frequency components of the series in order to estimate the underlying 
signal. (The choice f = 1 would have left the series unchanged.) 


In Figures 1.21 and 1.22 we show the results of smoothing the strike data by ex- 
ponential smoothing with parameter œ = 0.4 (see (1.5.7)) and by high-frequency 
elimination with f = 0.4, i.e., by eliminating a fraction 0.6 of the Fourier compo- 
nents at the top of the frequency range. These should be compared with the simple 
5-term moving average smoothing shown in Figure 1.18. Experimentation with dif- 
ferent smoothing parameters can easily be carried out using the program ITSM. The 
exponentially smoothed value of the last observation is frequently used to forecast 
the next data value. The program automatically selects an optimal value of œ for this 
purpose if «œ is specified as — 1 in the exponential smoothing dialog box. 


(d) Polynomial fitting. In Section 1.3.2 we showed how a trend of the form 
m; = ao + aıt + aot? can be fitted to the data {x,, ..., Xn} by choosing the parameters 
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Figure 1-22 

Strike data smoothed 

by elimination of high 
frequencies with f = 0.4. 


(thousands) 


do, 4, and az to minimize the sum of squares, Y (x; — m,)* (see Example 1.3.4). 
The method of least squares estimation can also be used to estimate higher-order 
polynomial trends in the same way. The Regression option of ITSM allows least 
squares fitting of polynomial trends of order up to 10 (together with up to four har- 
monic terms; see Example 1.3.6). It also allows generalized least squares estimation 
(see Section 6.6), in which correlation between the residuals is taken into account. 


Method 2: Trend Elimination by Differencing 

Instead of attempting to remove the noise by smoothing as in Method 1, we now 
attempt to eliminate the trend term by differencing. We define the lag-1 difference 
operator V by 


VX, = X,—-X,-, = 1 — B)X,, (1.5.9) 
where B is the backward shift operator, 
BX, = X. (1.5.10) 


Powers of the operators B and V are defined in the obvious way, i.e., B/(X,) = X,- j 
and V/(X,) = V(V/-!(X,)), j = 1, with V°(X,) = X,. Polynomials in B and V are 
manipulated in precisely the same way as polynomial functions of real variables. For 
example, 

V? X, = V(V(X,)) = (1 — B)(1 — B)X, = (1 — 2B + B’)X, 


= X, — 2X1 + X;-2. 
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Figure 1-23 

The twice-differenced series 
derived from the population 
data of Figure 1.5. 


If the operator V is applied to a linear trend function m, = co +c;t, then we obtain the 
constant function Vm; = m, — m;_1 = Co + cit — (co + c1(t — 1)) = c1. In the same 
way any polynomial trend of degree k can be reduced to a constant by application of 
the operator V* (Problem 1.10). For example, if X, = m, + Y,, where m, = DA ct! 
and Y, is stationary with mean zero, application of V* gives 


VEX, = k!c + VEY, 


a stationary process with mean k!c,. These considerations suggest the possibility, 
given any sequence {x,} of data, of applying the operator V repeatedly until we find 
a sequence {V*x,} that can plausibly be modeled as a realization of a stationary 
process. It is often found in practice that the order k of differencing required is quite 
small, frequently one or two. (This relies on the fact that many functions can be 
well approximated, on an interval of finite length, by a polynomial of reasonably low 
degree.) 


Applying the operator V to the population values {x;, t = 1, ..., 20} of Figure 1.5, we 
find that two differencing operations are sufficient to produce a series with no apparent 
trend. (To carry out the differencing using ITSM, select Transform>Difference, 
enter the value 1 for the differencing lag, and click OK.) This replaces the original 
series {x,} by the once-differenced series {x, — x,_;}. Repetition of these steps gives 
the twice-differenced series V?x, = x, — 2x;_, + X;~2, plotted in Figure 1.23. Notice 
that the magnitude of the fluctuations in V’x, increases with the value of x,. This effect 
can be suppressed by first taking natural logarithms, y; = In x;, and then applying the 
operator V? to the series {y,}. (See also Figures 1.1 and 1.17.) 
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1.5.2 Estimation and Elimination of Both Trend and Seasonality 


The methods described for the estimation and elimination of trend can be adapted in 
a natural way to eliminate both trend and seasonality in the general model, specified 
as follows. 


Classical Decomposition Model 
X,=m,+5,+Y,, toHl,...,n, (1.5.11) 


where EY, = 0, Sa = Sn and ae Sj = 0. 


We shall illustrate these methods with reference to the accidental deaths data of 
Example 1.1.3, for which the period d of the seasonal component is clearly 12. 


Method $1: Estimation of Trend and Seasonal Components 
The method we are about to describe is used in the Transform>Classical option 
of ITSM. 

Suppose we have observations {x,,..., Xn}. The trend is first estimated by ap- 
plying a moving average filter specially chosen to eliminate the seasonal component 
and to dampen the noise. If the period d is even, say d = 2q, then we use 


Ms, = (O.SxHg + Xir-q41 He + Xr4q-1 + 95%4q)/d, q<t<n-—q. (1.5.12) 


If the period is odd, say d = 2q + 1, then we use the simple moving average (1.5.5). 

The second step is to estimate the seasonal component. For each k = 1,...,d,we 
compute the average w of the deviations {(xz+ja —Mk+ja), q < k+jd < n—q}. Since 
these average deviations do not necessarily sum to zero, we estimate the seasonal 
component s as 


d 
S =w dY wi, k=1,...,d, (1.5.13) 
i=l 


and Sk = Sk—d> k>d. 
The deseasonalized data is then defined to be the original series with the estimated 
seasonal component removed, i.e., 


d,=%x%,—-S,, t=1,...,n. (1.5.14) 


Finally, we reestimate the trend from the deseasonalized data {d,} using one of 
the methods already described. The program ITSM allows you to fit a least squares 
polynomial trend m to the deseasonalized series. In terms of this reestimated trend 
and the estimated seasonal component, the estimated noise series is then given by 


Y,=x,-m,—-S,, t=1,...,n. 
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Figure 1-24 

The deseasonalized 
accidental deaths 
data from ITSM. 


The reestimation of the trend is done in order to have a parametric form for the trend 
that can be extrapolated for the purposes of prediction and simulation. 


Figure 1.24 shows the deseasonalized accidental deaths data obtained from ITSM 
by reading in the series DEATHS.TSM, selecting Transform>Classical, check- 
ing only the box marked Seasonal Fit, entering 12 for the period, and clicking 
OK. The estimated seasonal component 5,, shown in Figure 1.25, is obtained by se- 
lecting Transform>Show Classical Fit. (Except for having a mean of zero, this 
estimate is very similar to the harmonic regression function with frequencies 27/12 
and 27/6 displayed in Figure 1.11.) The graph of the deseasonalized data suggests 
the presence of an additional quadratic trend function. In order to fit such a trend to 
the deseasonalized data, select Transform>Undo Classical to retrieve the original 
data and then select Transform>Classical and check the boxes marked Seasonal 
Fit and Polynomial Trend, entering 12 for the period and selecting Quadratic 
for the trend. Then click OK and you will obtain the trend function 


mm, = 9952 — 71.821 + 0.826077, 1 <1 <72. 
At this point the data stored in ITSM consists of the estimated noise 
Ê, = x, — ñu — $, = ea A 


obtained by subtracting the estimated seasonal and trend components from the original 
data. 
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data from ITSM. 


Example 1.5.5 


Method $2: Elimination of Trend and Seasonal Components by Differencing 
The technique of differencing that we applied earlier to nonseasonal data can be 
adapted to deal with seasonality of period d by introducing the lag-d differencing 
operator V4 defined by 


VX; = X, — Xa = (l = BY. (1.5.15) 


(This operator should not be confused with the operator V? = (1 — B)? defined 
earlier.) 
Applying the operator V4 to the model 


X,=m+s5,4+ Y;, 
where {s,} has period d, we obtain 
VaX; =m; — ma + Y; — Y,a, 


which gives a decomposition of the difference V4X, into a trend component (m, — 
m,—q) and a noise term (Y, — Y,—a). The trend, m, —m,_,, can then be eliminated using 
the methods already described, in particular by applying a power of the operator V. 


Figure 1.26 shows the result of applying the operator Vız to the accidental deaths 
data. The graph is obtained from ITSM by opening DEATHS.TSM, selecting Trans- 
form>Difference, entering lag 12, andclicking OK. The seasonal component evident 
in Figure 1.3 is absent from the graph of Vi2x;, 13 < t < 72. However, there still 
appears to be a nondecreasing trend. If we now apply the operator V to {V,2x;} by 
again selecting Transform>Difference, this time with lag one, we obtain the graph 
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Figure 1-26 

The differenced series 
Woe t = 13,...,72} 
derived from the monthly 
accidental deaths 

et = 1,...,72}. 


Figure 1-27 

The differenced series 
(VVix, t =14,...,72} 
derived from the monthly 
accidental deaths 

at = 1,...,72}. 
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of VVi2x,, 14 < t < 72, shown in Figure 1.27, which has no apparent trend or sea- 
sonal component. In Chapter 5 we shall show that this doubly differenced series can 
in fact be well represented by a stationary time series model. 


In this section we have discussed a variety of methods for estimating and/or 
removing trend and seasonality. The particular method chosen for any given data 
set will depend on a number of factors including whether or not estimates of the 
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components of the series are required and whether or not it appears that the data 
contain a seasonal component that does not vary with time. The program ITSM 
allows two options under the Transform menu: 


1. “classical decomposition,” in which trend and/or seasonal components are esti- 
mated and subtracted from the data to generate a noise sequence, and 

2. “differencing,” in which trend and/or seasonal components are removed from the 
data by repeated differencing at one or more lags in order to generate a noise 
sequence. 


A third option is to use the Regression menu, possibly after applying a Box—Cox 
transformation. Using this option we can (see Example 1.3.6) 


3. fit a sum of harmonics and a polynomial trend to generate a noise sequence that 
consists of the residuals from the regression. 


In the next section we shall examine some techniques for deciding whether or not the 
noise sequence so generated differs significantly from iid noise. If the noise sequence 
does have sample autocorrelations significantly different from zero, then we can take 
advantage of this serial dependence to forecast future noise values in terms of past 
values by modeling the noise as a stationary time series. 


1.6 Testing the Estimated Noise Sequence 


The objective of the data transformations described in Section 1.5 is to produce a 
series with no apparent deviations from stationarity, and in particular with no apparent 
trend or seasonality. Assuming that this has been done, the next step is to model the 
estimated noise sequence (i.e., the residuals obtained either by differencing the data 
or by estimating and subtracting the trend and seasonal components). If there is no 
dependence among between these residuals, then we can regard them as observations 
of independent random variables, and there is no further modeling to be done except to 
estimate their mean and variance. However, if there is significant dependence among 
the residuals, then we need to look for a more complex stationary time series model 
for the noise that accounts for the dependence. This will be to our advantage, since 
dependence means in particular that past observations of the noise sequence can assist 
in predicting future values. 

In this section we examine some simple tests for checking the hypothesis that 
the residuals from Section 1.5 are observed values of independent and identically 
distributed random variables. If they are, then our work is done. If not, then we must 
use the theory of stationary processes to be developed in later chapters to find a more 
appropriate model. 
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(a) The sample autocorrelation function. For large n, the sample autocorre- 
lations of an iid sequence Yj, ..., Y, with finite variance are approximately iid with 
distribution N(O, 1/n) (see TSTM p. 222). Hence, if y,;,..., y, is a realization of 
such an iid sequence, about 95% of the sample autocorrelations should fall between 
the bounds +1.96/./n. If we compute the sample autocorrelations up to lag 40 and 
find that more than two or three values fall outside the bounds, or that one value falls 
far outside the bounds, we therefore reject the iid hypothesis. The bounds +1.96/,/n 
are automatically plotted when the sample autocorrelation function is computed by 
the program ITSM. 

(b) The portmanteau test. Instead of checking to see whether each sample 
autocorrelation (j) falls inside the bounds defined in (a) above, it is also possible 
to consider the single statistic 


h 
Can) oO): 
j=l 


If Y,,..., Y, isa finite-variance iid sequence, then by the same result used in (a), Q is 
approximately distributed as the sum of squares of the independent N(0, 1) random 
variables, ./np(j), j = 1,...,h, i.e., as chi-squared with h degrees of freedom. A 
large value of Q suggests that the sample autocorrelations of the data are too large for 
the data to be a sample from an iid sequence. We therefore reject the iid hypothesis 
at level æ if Q > x;?_,(h), where x;_,(h) is the 1 — a quantile of the chi-squared 
distribution with h degrees of freedom. The program ITSM conducts a refinement of 
this test, formulated by Ljung and Box (1978), in which Q is replaced by 


h 
Qis =n(n +2) $ 8G) — j), 
j=1 

whose distribution is better approximated by the chi-squared distribution with h 
degrees of freedom. 

Another portmanteau test, formulated by McLeod and Li (1983), can be used as 
a further test for the iid hypothesis, since if the data are iid, then the squared data are 
also iid. It is based on the same statistic used for the Ljung—Box test, except that the 
sample autocorrelations of the data are replaced by the sample autocorrelations of 
the squared data, ww (h), giving 


h 
Om = n(n +2) X piyw(k)/(n — k). 
k=l 

The hypothesis of iid data is then rejected at level a if the observed value of Om is 
larger than the 1 — a quantile of the x*(h) distribution. 

(c) The turning point test. If y,,..., Yn is a sequence of observations, we say 
that there is a turning point at time i, 1 <i < n, if y;_; < y; and y; > y,4, or if 
yj-1 > y; and y; < y;4,. If T is the number of turning points of an iid sequence of 
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length n, then, since the probability of a turning point at time i is 2, the expected 
value of T is 


ur = E(T) = 2(n — 2)/3. 
It can also be shown for an iid sequence that the variance of T is 
o7 = Var(T) = (16n — 29)/90. 


A large value of T — ur indicates that the series is fluctuating more rapidly than 
expected for an iid sequence. On the other hand, a value of T — ur much smaller 
than zero indicates a positive correlation between neighboring observations. For an 
iid sequence with n large, it can be shown that 


T is approximately N(ur, 07). 


This means we can carry out a test of the iid hypothesis, rejecting it at level œ if 
|T — ur|/or > ®i-a2, where ®i—a/2 is the 1 — œ/2 quantile of the standard normal 
distribution. (A commonly used value of «œ is .05, for which the corresponding value 
of Pia is 1.96.) 

(d) The difference-sign test. For this test we count the number S of values of i 
such that y; > y;-1,i = 2, . . . , n, or equivalently the number of times the differenced 
series y; — y;—ı is positive. For an iid sequence it is clear that 


1 
us = ES = 5m — 1). 
It can also be shown, under the same assumption, that 
og = Var(S) = (n + 1)/12, 
and that for large n, 
S is approximately N(us, 05). 


A large positive (or negative) value of S — us indicates the presence of an increasing 
(or decreasing) trend in the data. We therefore reject the assumption of no trend in 
the data if |S = Ls\|/os > P42. 

The difference-sign test must be used with caution. A set of observations exhibit- 
ing a strong cyclic component will pass the difference-sign test for randomness, since 
roughly half of the observations will be points of increase. 

(e) The rank test. The rank test is particularly useful for detecting a linear trend 
in the data. Define P to be the number of pairs (i, j) such that y; > y; and j > i, 
i = 1l1,...,n — 1. There is a total of (5) = in(n — 1) pairs (i, j) such that j > i. For 
an iid sequence {Y;,..., Y,}, each event {Y; > Y;} has probability }, and the mean 
of P is therefore 


1 
Up = an = 1). 
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It can also be shown for an iid sequence that the variance of P is 
op = n(n — 1)(2n+5)/72 

and that for large n, 
P is approximately N (up, op) 


(see Kendall and Stuart, 1976). A large positive (negative) value of P — up indicates 
the presence of an increasing (decreasing) trend in the data. The assumption that 
{y;} is a sample from an iid sequence is therefore rejected at level a = 0.05 if 
|P — up|/op > Pian = 1.96. 

(£) Fitting an autoregressive model. A further test that can be carried out using 
the program ITSM is to fit an autoregressive model to the data using the Yule-Walker 
algorithm (discussed in Section 5.1.1) and choosing the order which minimizes the 
AICC statistic (see Section 5.5). A selected order equal to zero suggests that the data 
is white noise. 

(g) Checking for normality. If the noise process is Gaussian, i.e., if all of its 
joint distributions are normal, then stronger conclusions can be drawn when a model 
is fitted to the data. The following test enables us to check whether it is reasonable 
to assume that observations from an iid sequence are also Gaussian. 

Let Ya) < Ya) < --- < Yn) be the order statistics of a random sample Y,..., Y, 
from the distribution N(u, 0”). If X ay < Xo <- < Xm are the order statistics 
from a N(O, 1) sample of size n, then 


EY =u +omj;, 


where m; = EX), j = 1, ..., n. The graph of the points (m1, Ya)), ..., (Mn, Yon) 
is called a Gaussian qq plot) and can be displayed in ITSM by clicking on the yellow 
button labeled QQ. If the normal assumption is correct, the Gaussian qq plot should be 
approximately linear. Consequently, the squared correlation of the points (m;, Ya), 
i = 1,...,n, should be near 1. The assumption of normality is therefore rejected if the 
squared correlation R? is sufficiently small. If we approximate m; by ®~!((i — .5)/n) 
(see Mage, 1982 for some alternative approximations), then R? reduces to 


n Fa! (i-5)\? 
Xi- Yo -NS (F*)) 
n VU n = EE 2? 
Via Vo - Y} diet (® i (55) 
where Y = n~! (Yı +- -+ Y,). Percentage points for the distribution of R?, assuming 
normality of the sample values, are given by Shapiro and Francia (1972) for sample 


sizes n < 100. For n = 200, P(R? < .987) = .05 and P(R? < .989) = .10. For 
larger values of n the Jarque-Bera test for normality can be used (see Section 5.3.3). 


R= 


If we did not know in advance how the signal plus noise data of Example 1.1.4 were 
generated, we might suspect that they came from an iid sequence. We can check this 
hypothesis with the aid of the tests (a)—(f) introduced above. 
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Figure 1-28 

The sample autocorrelation 
function for the data of 
Example 1.1.4 showing 

the bounds +1.96/,/n. 


ACF 


(a) The sample autocorrelation function (Figure 1.28) is obtained from ITSM by 
opening the project SIGNAL.TSM and clicking on the second yellow button at the 
top of the ITSM window. Observing that 25% of the autocorrelations are outside the 
bounds +1.96/./200, we reject the hypothesis that the series is iid. 


The remaining tests (b), (c), (d), (e), and (f) are performed by choosing the option 
Statistics>Residual Analysis>Tests of Randomness. (Since no model has 
been fitted to the data, the residuals are the same as the data themselves.) 


(b) The sample value of the Ljung—Box statistic Qirg with h = 20 is 51.84. Since 
the corresponding p-value (displayed by ITSM) is .00012 < .05, we reject the iid 
hypothesis at level .05. The p-value for the McLeod-Li statistic Qm is 0.717. The 
McLeod-Li statistic does therefore not provide sufficient evidence to reject the iid 
hypothesis at level .05. 

(c) The sample value of the turning-point statistic T is 138, and the asymptotic 
distribution under the iid hypothesis (with sample size n = 200) is N(132, 35.3). Thus 
|T — ur|/or = 1.01, corresponding to a computed p-value of .312. On the basis of 
the value of T there is therefore not sufficient evidence to reject the iid hypothesis at 
level .05. 

(d) The sample value of the difference-sign statistic S is 101, and the asymptotic 
distribution under the iid hypothesis (with sample size n = 200) is N(99.5, 16.7). 
Thus |S — us|/os = 0.38, corresponding to a computed p-value of 0.714. On the basis 
of the value of S there is therefore not sufficient evidence to reject the iid hypothesis 
at level .05. 
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(e) The sample value of the rank statistic P is 10310, and the asymptotic dis- 
tribution under the iid hypothesis (with n = 200) is N(9950, 2.239 x 10°). Thus 
|P — up| /op = 0.76, corresponding to a computed p-value of 0.447. On the basis of 
the value of P there is therefore not sufficient evidence to reject the iid hypothesis at 
level .05. 

(£) The minimum-AICC Yule—Walker autoregressive model for the data is of 
order seven, supporting the evidence provided by the sample ACF and Ljung—Box 
tests against the iid hypothesis. 

Thus, although not all of the tests detect significant deviation from iid behavior, 
the sample autocorrelation, the Ljung—Box statistic, and the fitted autoregression pro- 
vide strong evidence against it, causing us to reject it (correctly) in this example. 


The general strategy in applying the tests described in this section is to check 
them all and to proceed with caution if any of them suggests a serious deviation 
from the iid hypothesis. (Remember that as you increase the number of tests, the 
probability that at least one rejects the null hypothesis when it is true increases. You 
should therefore not necessarily reject the null hypothesis on the basis of one test 
result only.) 


1.1. Let X and Y be two random variables with E(Y) = u and EY? < oo. 
a. Show that the constant c that minimizes E(Y — c)* isc = n. 
b. Deduce that the random variable f(X) that minimizes £ [Y F(X ] is 


f(X) = E[Y|X]. 
c. Deduce that the random variable f(X) that minimizes E (Y — f (X))? is also 
f(X) = E[Y|X]. 


1.2. (Generalization of Problem 1.1.) Suppose that X1, X2, ... is a sequence of ran- 
dom variables with E(X?) < œ and E(X,) = u. 


a. Show that the random variable f(X,,..., Xn) that minimizes E| (Xn — 
f(Xı, e...’ XIX, e...’ X,,| is 
F(X, 6-6, Xn) = E[Xngi|X1,..., Xn]. 
b. Deduce that the random variable f(X,,..., Xn) that minimizes EF [(Xn41 — 
f (X,...,X,))"] is also 
f(%, rr) Xn) = E[Xn41|X1, e.’ Xal. 


c. If X,, X2,...is iid with E(X?) < œ and EX; = n, where u is known, what 
is the minimum mean squared error predictor of X„+ı in terms of X,,..., Xn? 
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1.3. 
1.4. 


1.5. 


1.6. 


1.7. 


d. Under the conditions of part (c) show that the best linear unbiased estimator 
of u in terms of X),..., Xn is X = L(x, +--+ Xn). (À said to be an 
unbiased estimator of u if Eù = u for all u.) 

e. Under the conditions of part (c) show that X is the best linear predictor of 
X,+, that is unbiased for m. 

f. If X1, X2,... is iid with E(X?) < co and EX; = n, and if So = 0, S, = 
X,;+-:-+X,,n = 1,2,..., what is the minimum mean squared error 
predictor of S,; in terms of S,,..., Sn? 


Show that a strictly stationary process with E(X?) < oo is weakly stationary. 


Let {Z,} be a sequence of independent normal random variables, each with 
mean 0 and variance o°, and let a, b, and c be constants. Which, if any, of 
the following processes are stationary? For each stationary process specify the 
mean and autocovariance function. 


a. X, =at+bdZ,+cZ;_2 

b. X, = Zi cos(ct) + Z> sin(ct) 
c. X, = Z, cos(ct) + Z, sin(ct) 
d. X, =a +bZo 
e 
f. 


X< 
I 


Zo cos(ct) 
X, = ZZ 


Let {X,} be the moving-average process of order 2 given by 
X, = Zı + OZi-2, 


where {Z,} is WN(O, 1). 

a. Find the autocovariance and autocorrelation functions for this process when 
d= 8. 

b. Compute the variance of the sample mean (X, + X2 + X3 + X4)/4 when 
0 = 8. 

c. Repeat (b) when 0 = —.8 and compare your answer with the result obtained 
in (b). 

Let {X;} be the AR(1) process defined in Example 1.4.5. 

a. Compute the variance of the sample mean (X, + X2 + X3 + X4)/4 when 
¢@= 9ando? = 1. 

b. Repeat (a) when @ = —.9 and compare your answer with the result obtained 
in (a). 


If {X,} and {Y,} are uncorrelated stationary sequences, i.e., if X, and Y, are 
uncorrelated for every r and s, show that {X, + Y,} is stationary with autoco- 
variance function equal to the sum of the autocovariance functions of {X,} and 


{Y;}. 
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1.8. Let {Z,} be IID N(O, 1) noise and define 
Za if t is even, 
© (Z2, -1)/v2, iftis odd. 


Xı 


a. Show that {X,} is WN (0, 1) but not iid (0, 1) noise. 


b. Find E(X,„,+1|X1, ..., Xn) for n odd and n even and compare the results. 


1.9. Let {x;,...,x,} be observed values of a time series at times 1, ...,7, and let 
p(h) be the sample ACF at lag h as in Definition 1.4.4. 


a. If x, = a + bt, where a and b are constants and b Æ 0, show that for each 
fixed h > 1, 


p(h) > lasn > œ. 


b. If x, = ccos(wt), where c and w are constants (c 4 0 and w € (—7z, 7]), 
show that for each fixed h, 


p(h) > cos(wh) as n —> oo. 


1.10. If m, = a ct*, t = 0,+1,..., show that Vm, is a polynomial of degree 
p — 1 in t and hence that V?t!m, = 0. 


1.11. Consider the simple moving-average filter with weights a; = (2g+1)~', —q < 


JS. 
a. If m, = co + cit, show that ey ajMj = Mm. 
b. If Z,, t = 0, +1, +2,..., are independent random variables with mean 0 and 


variance o”, show that the moving average A, = DE q aj Zi- 18 “small” 
for large q in the sense that E A, = 0 and Var(A,) = o? /(2q + 1). 


1.12. a. Show that a linear filter {a;} passes an arbitrary polynomial of degree k 
without distortion, i.e., that 


Mm, = ) ajmy—j 


J 


for all kth-degree polynomials m, = co + cıt +--+ + cxt*, if and only if 
Soa; =1 and 
J 
oy a0, Hora iek 
J 


b. Deduce that the Spencer 15-point moving-average filter {a;} defined by 
(1.5.6) passes arbitrary third-degree polynomial trends without distortion. 
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1.13. 


1.14. 


1.15. 


1.16. 


1.17. 


1.18. 


Find a filter of the form 1 + «B + BB? + y B? (ie., find a, 6, and y) that 
passes linear trends without distortion and that eliminates arbitrary seasonal 
components of period 2. 


Show that the filter with coefficients [a_., a—1, ao, a1, a2] = s[—1, 4,3,4,-1] 
passes third-degree polynomials and eliminates seasonal components with pe- 
riod 3. 


Let {Y,} be a stationary process with mean zero and let a and b be constants. 


a. If X, = a+ bt + s, + Y;, where s, is a seasonal component with period 
12, show that VV,.X, = (1 — B)(1 — B"’)X, is stationary and express its 
autocovariance function in terms of that of {Y;}. 


b. If X, = (a+ bt)s, + Y;, where s, is a seasonal component with period 12, 
show that V?,X; = (1 — B'?)*X;, is stationary and express its autocovariance 
function in terms of that of {Y;}. 


(Using ITSM to smooth the strikes data.) Double-click on the ITSM icon, select 
File>Project>Open>Univariate, click OK, and open the file STRIKES. 
TSM. The graph of the data will then appear on your screen. To smooth the 
data select Smooth>Moving Ave, Smooth>Exponential, or Smooth>FFT. Try 
using each of these to reproduce the results shown in Figures 1.18, 1.21, and 
1.22. 


(Using ITSM to plot the deaths data.) In ITSM select File>Project>Open> 
Univariate, click OK, and open the project DEATHS.TSM. The graph of 
the data will then appear on your screen. To see a histogram of the data, click 
on the sixth yellow button at the top of the ITSM window. To see the sample 
autocorrelation function, click on the second yellow button. The presence of a 
strong seasonal component with period 12 is evident in the graph of the data 
and in the sample autocorrelation function. 


(Using ITSM to analyze the deaths data.) Open the file DEATHS.TSM, select 
Transform>Classical, check the box marked Seasonal Fit, and enter 12 
for the period. Make sure that the box labeled Polynomial Fit is not checked, 
and click, OK. You will then see the graph (Figure 1.24) of the deseasonalized 
data. This graph suggests the presence of an additional quadratic trend function. 
To fit such a trend to the deseasonalized data, select Transform>Undo Clas- 
sical to retrieve the original data. Then select Transform>Classical and 
check the boxes marked Seasonal Fit and Polynomial Trend, entering 12 
for the period and Quadratic for the trend. Click OK and you will obtain the 
trend function 


M, = 9952 — 71.82t + 0.8260r7, 1 <t < 72. 
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At this point the data stored in ITSM consists of the estimated noise 
Ê, = x, — hy = $, FHM oe 72; 


obtained by subtracting the estimated seasonal and trend components from the 
original data. The sample autocorrelation function can be plotted by clicking 
on the second yellow button at the top of the ITSM window. Further tests for 
dependence can be carried out by selecting the options Statistics>Residual 
Analysis>Tests of Randomness. It is clear from these that there is substan- 
tial dependence in the series {Y;}. 

To forecast the data without allowing for this dependence, select the option Fore- 
casting>ARMA. Specify 24 for the number of values to be forecast, and the program 
will compute forecasts based on the assumption that the estimated seasonal and trend 
components are true values and that {Y,} is a white noise sequence with zero mean. 
(This is the default model assumed by ITSM until a more complicated stationary 
model is estimated or specified.) The original data are plotted with the forecasts 
appended. 

Later we shall see how to improve on these forecasts by taking into account the 
dependence in the series {Y;}. 


1.19. Use a text editor, e.g., WORDPAD or NOTEPAD, to construct and save a 
text file named TEST.TSM, which consists of a single column of 30 numbers, 
{x1, eg X30}, defined by 


X1,..., X10 | 486, 474, 434, 441, 435, 401, 414, 414, 386, 405; 
X11,..-, X20 : 411, 389, 414, 426, 410, 441, 459, 449, 486, 510; 
X21, +--+, X30 | 506, 549, 579, 581, 630, 666, 674, 729, 771, 785. 


This series is in fact the sum of a quadratic trend and a period-three seasonal 
component. Use the program ITSM to apply the filter in Problem 1.14 to this 
time series and discuss the results. 
(Once the data have been typed, they can be imported directly into ITSM by 
coping and pasting to the clipboard, and then in ITSM selecting File>Project>New> 
Univariate, clicking on OK and selecting File>Import Clipboard. ) 


Stationary Processes 


2.1 Basic Properties 

2.2 Linear Processes 

2.3 Introduction to ARMA Processes 

2.4 Properties of the Sample Mean and Autocorrelation Function 
2.5 Forecasting Stationary Time Series 

2.6 The Wold Decomposition 


A key role in time series analysis is played by processes whose properties, or some 
of them, do not vary with time. If we wish to make predictions, then clearly we 
must assume that something does not vary with time. In extrapolating deterministic 
functions it is common practice to assume that either the function itself or one of its 
derivatives is constant. The assumption of a constant first derivative leads to linear 
extrapolation as a means of prediction. In time series analysis our goal is to predict 
a series that typically is not deterministic but contains a random component. If this 
random component is stationary, in the sense of Definition 1.4.2, then we can develop 
powerful techniques to forecast its future values. These techniques will be developed 
and discussed in this and subsequent chapters. 


2.1 Basic Properties 


In Section 1.4 we introduced the concept of stationarity and defined the autocovari- 
ance function (ACVF) of a stationary time series {X,} as 


y(h) = Cov(Xi44,X;), &=0,+1,+42,.... 
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The autocorrelation function (ACF) of {X,} was defined similarly as the function p(-) 
whose value at lag h is 


r% 
y0) 

The ACVF and ACF provide a useful measure of the degree of dependence among 

the values of a time series at different times and for this reason play an important 

role when we consider the prediction of future values of the series in terms of the 

past and present values. They can be estimated from observations of X4, ..., X, by 

computing the sample ACVF and ACF as described in Section 1.4.1. 

The role of the autocorrelation function in prediction is illustrated by the fol- 
lowing simple example. Suppose that {X,} is a stationary Gaussian time series (see 
Definition A.3.2) and that we have observed X,,. We would like to find the function 
of X, that gives us the best predictor of X„+n, the value of the series after another h 
time units have elapsed. To define the problem we must first say what we mean by 
“best.” A natural and computationally convenient definition is to specify our required 
predictor to be the function of X, with minimum mean squared error. In this illus- 
tration, and indeed throughout the remainder of this book, we shall use this as our 
criterion for “best.” Now by Proposition A.3.1 the conditional distribution of Xn+n 
given that X,, = x, is 


N(u + p()On — u), 07(1 — p(h)’)), 


where u and øo? are the mean and variance of {X,}. It was shown in Problem 1.1 that 
the value of the constant c that minimizes E(X,,4; —c)* is c = E(X„+n) and that the 
function m of X,, that minimizes E(X„+n — m(X,))* is the conditional mean 


ph) 


m(X,) = E(Xn+n|Xn) =u+ p(h)(Xn = H). (2.1.1) 
The corresponding mean squared error is 
E(Xn4n —m(X,))? = 0° (1 = oh)’). (2.1.2) 


This calculation shows that at least for stationary Gaussian time series, prediction of 
Xn+n in terms of X, is more accurate as |o(h)| becomes closer to 1, and in the limit 
as pọ —> +1 the best predictor approaches u + (X,, — u) and the corresponding mean 
squared error approaches 0. 

In the preceding calculation the assumption of joint normality of X,,, and X, 
played a crucial role. For time series with nonnormal joint distributions the corre- 
sponding calculations are in general much more complicated. However, if instead of 
looking for the best function of X,, for predicting X„+n, we look for the best linear 
predictor, i.e., the best predictor of the form £(X,,) = aX, + b, then our problem 
becomes that of finding a and b to minimize E(X,, — aX, — b)’. An elementary 
calculation (Problem 2.1), shows that the best predictor of this form is 


(Xn) = M+ plh)(Xn — H) (2.1.3) 
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with corresponding mean squared error 
E (Xnr — U(X? = 07 (1 — p(h)’). (2.1.4) 


Comparison with (2.1.1) and (2.1.3) shows that for Gaussian processes, €(X,,) and 
m(X,,) are the same. In general, of course, m(X,,) will give smaller mean squared 
error than £(X,), since it is the best of a larger class of predictors (see Problem 1.8). 
However, the fact that the best linear predictor depends only on the mean and ACF of 
the series {X,} means that it can be calculated without more detailed knowledge of the 
joint distributions. This is extremely important in practice because of the difficulty 
of estimating all of the joint distributions and because of the difficulty of computing 
the required conditional expectations even if the distributions were known. 

As we shall see later in this chapter, similar conclusions apply when we consider 
the more general problem of predicting X„+n as a function not only of X,,, but also of 
Xn—1, Xn-2,---- Before pursuing this question we need to examine in more detail the 
properties of the autocovariance and autocorrelation functions of a stationary time 
series. 


Basic Properties of +y(-): 
y (0) =0 
Iy (h)| < y (0) for all A, 


and y(-) is even, i.e., 


y (h) = y (—h) for all h. 


The first property is simply the statement that Var(X,) > 0, the second is an immediate 
consequence of the fact that correlations are less than or equal to 1 in absolute value 
(or the Cauchy—Schwarz inequality), and the third is established by observing that 


y (h) = Cov(Xi4n, X1) = Cov(X,, Xr) = Y (h). E 


Autocovariance functions have another fundamental property, namely that of 
nonnegative definiteness. 


A real-valued function « defined on the integers is nonnegative definite if 
X aik(i — ja; = 0 (2.1.5) 
i,j=l 


for all positive integers n and vectors a = (a),...,d,)’ with real-valued compo- 
nents q;. 
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Proof 


Example 2.1.1 


A real-valued function defined on the integers is the autocovariance function of a 
Stationary time series if and only if it is even and nonnegative definite. 


To show that the autocovariance function y(-) of any stationary time series {X;} is 
nonnegative definite, let a be any n x 1 vector with real components a),..., a, and 
let X„ = (X,,..., X,)’. Then by equation (A.2.5) and the nonnegativity of variances, 


n 
Var(a’X,) =a'T,a = Y aiy(i — ja; > 0, 
ij=l 
where I, is the covariance matrix of the random vector X,,. The last inequality, 
however, is precisely the statement that y(-) is nonnegative definite. The converse 
result, that there exists a stationary time series with autocovariance function « if x is 
even, real-valued, and nonnegative definite, is more difficult to establish (see TSTM, 
Theorem 1.5.1 for a proof). A slightly stronger statement can be made, namely, that 
under the specified conditions there exists a stationary Gaussian time series {X,} with 
mean 0 and autocovariance function «x (-). E 


Remark 1. An autocorrelation function p(-) has all the properties of an autocovari- 
ance function and satisfies the additional condition o(0) = 1. In particular, we can 
say that p(-) is the autocorrelation function of a stationary process if and only if p(-) 
is an ACVF with p(0) = 1. 


Remark 2. To verify that a given function is nonnegative definite it is often simpler 
to find a stationary process that has the given function as its ACVF than to verify the 
conditions (2.1.5) directly. For example, the function x (h) =cos(wh) is nonnegative 
definite, since (see Problem 2.2) it is the ACVF of the stationary process 


X, = Acos(ot) + B sin(œt), 


where A and B are uncorrelated random variables, both with mean 0 and variance 1. 
Another illustration is provided by the following example. 


We shall show now that the function defined on the integers by 
1, ifh=O0O, 
K(h)=}p, ifh= +1, 
0, otherwise, 


is the ACVF of a stationary time series if and only if | o| < L, Inspection of the ACVF 
of the MA(1) process of Example 1.4.4 shows that « is the ACVF of such a process 
if we can find real 6 and nonnegative o? such that 


o7(1+6*) =1 
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and 
o0 =p. 

If |p| < 4, these equations give solutions 6 = (2p)7!(1 + y1 — 4p?) and o? = 
(1+ o). However, if |p| > 4, there is no real solution for 6 and hence no MA(1) 
process with ACVF x. To show that there is no stationary process with ACVF «x, 
we need to show that «x is not nonnegative definite. We shall do this directly from 
the definition (2.1.5). First, if o > 1, K = [k(i — j)] and a is the n-component 
vector a = (1, —1,1,—1,...)’, then 


n 
ij=V 


a’ Ka=n-—2(n—1)p < Oforn > 20/(2p — 1), 


showing that « (-) is not nonnegative definite and therefore, by Theorem 2.1.1, is not 
an autocovariance function. If pọ < =}, the same argument with a = (1, 1,1, 1,...) 
again shows that «(-) is not nonnegative definite. 

If {X,} is a (weakly) stationary time series, then the vector (X,,..., X,)’ and the 
time-shifted vector (X14;,..., Xn4,)’ have the same mean vectors and covariance 
matrices for every integer and positive integer n. A strictly stationary sequence is 
one in which the joint distributions of these two vectors (and not just the means and 
covariances) are the same. The precise definition is given below. 


{X;} is a strictly stationary time series if 
,d 
(Xi, aE P Xn) = (Xi+h, vt SF: Xnth)’ 


for all all integers h and n > 1. (Here £ is used to indicate that the two random 
vectors have the same joint distribution function.) 


For reference, we record some of the elementary properties of strictly stationary 
time series. 


Properties of a Strictly Stationary Time Series {X;}: 
a. The random variables X, are identically distributed. 
b. (Xi, Xin)’ = (X,, X14,)’ for all integers ¢ and h. 

c. {X,} is weakly stationary if E(X?) < co for all t. 
d. Weak stationarity does not imply strict stationarity. 
e 


. An iid sequence is strictly stationary. 


Properties (a) and (b) follow at once from Definition 2.1.2. If EX? < oo, then by (a) 
and (b) EX, is independent of t and Cov(X;, X;4,) = Cov(X1, X14), which is also 
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independent of t, proving (c). For (d) see Problem 1.8. If {X,} is an iid sequence of 
random variables with common distribution function F, then the joint distribution 
function of (Xy4;,..., Xn4,)’ evaluated at (x,,...,x,)' is F (x1) +--+ F(x,), which is 
independent of h. a 


One of the simplest ways to construct a time series {X,} that is strictly stationary 
(and hence stationary if EX? < 00) is to “filter” an iid sequence of random variables. 
Let {Z,} be an iid sequence, which by (e) is strictly stationary, and define 


Xx; = g(Z,, Zii, +++, Zt-q) (2.1.6) 
for some real-valued function g(-,...,-). Then {X,} is strictly stationary, since 
(Zin, <--> Zrtn—q)’ 2 (Z,,..., Z;-q)’ for all integers h. It follows also from the 


defining equation (2.1.6) that {X,} is g-dependent, i.e., that X, and X, are inde- 
pendent whenever |t — s| > q. (An iid sequence is 0-dependent.) In the same way, 
adopting a second-order viewpoint, we say that a stationary time series is g-correlated 
if y(h) = 0 whenever |A| > q. A white noise sequence is then O-correlated, while 
the MA(1) process of Example 1.4.4 is 1-correlated. The moving-average process of 
order q defined below is g-correlated, and perhaps surprisingly, the converse is also 
true (Proposition 2.1.1). 


The MA(q) Process: 


{X,} is amoving-average process of order q if 


Xi = Zi +021 + +++ + OZ- (2.1.7) 


where {Z,} ~ WN(0, o°) and 04, . . . , 6, are constants. 


It is a simple matter to check that (2.1.7) defines a stationary time series that is strictly 
stationary if {Z,} is iid noise. In the latter case, (2.1.7) is a special case of (2.1.6) with 
g a linear function. 

The importance of MA (q) processes derives from the fact that every q-correlated 
process is an MA(q) process. This is the content of the following proposition, whose 
proof can be found in TSTM, Section 3.2. The extension of this result to the case 
q = œ is essentially Wold’s decomposition (see Section 2.6). 


If {X,} is a stationary q-correlated time series with mean Q, then it can be represented 
as the MA(q) process in (2.1.7). 
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2.2 Linear Processes 


The class of linear time series models, which includes the class of autoregressive 
moving-average (ARMA) models, provides a general framework for studying sta- 
tionary processes. In fact, every second-order stationary process is either a linear 
process or can be transformed to a linear process by subtracting a deterministic com- 
ponent. This result is known as Wold’s decomposition and is discussed in Section 2.6. 


Definition 2.2.1 The time series {X;} is a linear process if it has the representation 
Xe Pa (2.2.1) 
j=—00 


for all t, where {Z,} ~ WN(0,o7) and {w,} is a sequence of constants with 
yall < o. 


In terms of the backward shift operator B, (2.2.1) can be written more compactly as 
X, = W(B)Z,, (2.2.2) 


where y (B) = De said y ; Bi . A linear process is called a moving average or MA(co) 
if y; = 0 for all j < 0, i.e., if 


Remark 1. The condition eae |w;| < œ ensures that the infinite sum in (2.2.1) 
converges (with probability one), since E|Z,| < o and 


EIX < $O (IW EIZ,-jl) < (>: Mal) < 00. 


j=—-œ j=- 


It also ensures that are y? < œ and hence (see Appendix C, Example C.1.1) that 
the series in (2.2.1) converges in mean square, i.e., that X, is the mean square limit 
of the partial sums )°",__,, Y; Z:-j. The condition }";___, |7j| < œ also ensures con- 
vergence in both senses of the more general series (2.2.3) considered in Proposition 
2.2.1 below. In Section 10.5 we consider a more general class of linear processes, the 
fractionally integrated ARMA processes, for which the coefficents are not absolutely 
summable but only square summable. 


The operator Y (B) can be thought of as a linear filter, which when applied to 
the white noise “input” series {Z,} produces the “output” {X,} (see Section 4.3). As 
established in the following proposition, a linear filter, when applied to any stationary 
input series, produces a stationary output series. 
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Proposition 2.2.1 


Proof 


Let {Y,} be a stationary time series with mean O and covariance function yy. If 
x [Wil < 00, then the time series 


X,= 3 wiY-; = W(B)Y, (2.2.3) 
j=—00 
is stationary with mean 0 and autocovariance function 
rth) = Y Y Varrh kj). (2.2.4) 
j=—00 k=—00 
In the special case where {X,} is a linear process, 


yx(h) = > WiVisno. (2.2.5) 


j=—0o 


The argument used in Remark 1, with o replaced by ,/ yy (0), shows that the series in 
(2.2.3) is convergent. Since EY, = 0, we have 


E(X,) =E ( Ye wr) = >> ET E0 
j=—00 j=-0 
and 


E(X14nX,) = E ( DP ta) ( X wis) 


j=—00 k=—0o 


= » XO Wie E Yi4n—¥i-0) 


j=- k=—00 


Y Y Witerr(h-— 7 +4), 


j=- k=—00 


which shows that {X,} is stationary with covariance function (2.2.4). (The interchange 
of summation and expectation operations in the above calculations can be justified 
by the absolute summability of y;.) Finally, if {Y,} is the white noise sequence {Z,} 
in (2.2.1), then yy(h — j +k) = o? if k = j — h and 0 otherwise, from which (2.2.5) 
follows. a 


Remark 2. The absolute convergence of (2.2.3) implies (Problem 2.6) that filters of 
the form a(B) = poe a; B/ and B(B) = ee 8; B/ with absolutely summable 
coefficients can be applied successively to a stationary series {Y,} to generate a new 


stationary series 


UE do Wii, 


j=% 
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Example 2.2.1 


where 


(oe) 


y= ækbj-k = >S BO j-r. (2.2.6) 


k=—0o k=—00 
These relations can be expressed in the equivalent form 
W, = Y(B)Y,, 
where 


Y (B) = a(B)B(B) = p (B)a (B), (2.2.7) 


and the products are defined by (2.2.6) or equivalently by multiplying the series 
o-o @ BY and °° _,, 6; B’ term by term and collecting powers of B. It is clear 
from (2.2.6) and (2.2.7) that the order of application of the filters w(B) and B(B) is 
immaterial. 


An AR(1) process 


In Example 1.4.5, an AR(1) process was defined as a stationary solution {X,} of the 
equations 


yaa Xi = Z, (2.2.8) 


where {Z,} ~ WN(0, o?), |¢| < 1, and Z, is uncorrelated with X, for each s < t. To 
show that such a solution exists and is the unique stationary solution of (2.2.8), we 
consider the linear process defined by 


X= 5 p ij (2.2.9) 
j=0 


(The coefficients ¢/ for j > 0 are absolutely summable, since |ġ| < 1.) It is easy 
to verify directly that the process (2.2.9) is a solution of (2.2.8), and by Proposition 
2.2.1 it is also stationary with mean 0 and ACVF 

ad! 


1— @’ 


yx(h) =) ` gigio? = 
j=0 


for h > 0. 
To show that (2.2.9) is the only stationary solution of (2.2.8) let {Y,} be any 
stationary solution. Then, iterating (2.2.8), we obtain 


Y, = oY, -1 F Z, 


= Z, +Z- +Y, 


=Z, oZ,_1 ie Zi H amas Saree 
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If {Y,} is stationary, then E y? is finite and independent of t, so that 


k 


EW, =Y pZ = @*V EO, a) 
j=0 


— O0ask> œ. 


This implies that Y, is equal to the mean square limit Dci bÌ Z,_; and hence that the 
process defined by (2.2.9) is the unique stationary solution of the equations (2.2.8). 

It the case |ġ| > 1, the series in (2.2.9) does not converge. However, we can 
rewrite (2.2.8) in the form 


X, = -O' Za + OX. (2.2.10) 
Iterating (2.2.10) gives 
Xi == Zra — O° Zia +O Xa 


= —$' Zr pt ect gp Z Ape + bX ia, 


which shows, by the same arguments used above, that 
X,=-) p Za (2.2.11) 
j=l 


is the unique stationary solution of (2.2.8). This solution should not be confused with 
the nonstationary solution {X,} of (2.2.8) obtained when Xo is any specified random 
variable that is uncorrelated with {Z,}. 

The solution (2.2.11) is frequently regarded as unnatural, since X, as defined by 
(2.2.11) is correlated with future values of Z,, contrasting with the solution (2.2.9), 
which has the property that X, is uncorrelated with Z, for all s > t. It is customary 
therefore in modeling stationary time series to restrict attention to AR(1) processes 
with |¢| < 1. Then X, has the representation (2.2.8) in terms of {Z,, s < t}, and we 
say that {X,} is a causal or future-independent function of {Z,}, or more concisely 
that {X,} is a causal autoregressive process. It should be noted that every AR(1) 
process with |¢| > 1 can be reexpressed as an AR(1) process with |¢| < 1 and anew 
white noise sequence (Problem 3.8). From a second-order point of view, therefore, 
nothing is lost by eliminating AR(1) processes with |ġ| > 1 from consideration. 

If ¢ = +1, there is no stationary solution of (2.2.8) (see Problem 2.8). 


Remark 3. Itis worth remarking that when || < 1 the unique stationary solution 
(2.2.9) can be found immediately with the aid of (2.2.7). To do this let ¢ (B) = 1—@B 
and x (B) = ) `$ o 6/ B/. Then 


W(B) := ¢(B)r(B) = 1. 
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Applying the operator 7 (B) to both sides of (2.2.8), we obtain 


(oe) 


X, = 0(B)Z, = D1 $'Z,-; 


j=0 


as claimed. 


2.3 Introduction to ARMA Processes 


Definition 2.3.1 


In this section we introduce, through an example, some of the key properties of an 
important class of linear processes known as ARMA (autoregressive moving average) 
processes. These are defined by linear difference equations with constant coefficients. 
As our example we shall consider the ARMA(1,1) process. Higher-order ARMA 
processes will be discussed in Chapter 3. 


The time series {X,} is an ARMA(I, 1) process if it is stationary and satisfies (for 
every t) 


X,— $X- = Z + OZ), (2.3.1) 
where {Z,} ~ WN(0, o°) and¢+0 #40. 


Using the backward shift operator B, (2.3.1) can be written more concisely as 
(B)X, = O(B)Z,, (2.3.2) 
where ø (B) and 6(B) are the linear filters 
p(B) = 1 — B and 0(B) = 1+ 6B, 


respectively. 

We first investigate the range of values of ¢ and @ for which a stationary solution 
of (2.3.1) exists. If |¢| < 1, let x(z) denote the power series expansion of 1/¢ (2), 
i.e., Da ¢/z/, which has absolutely summable coefficients. Then from (2.2.7) we 
conclude that x (B)¢(B) = 1. Applying x (B) to each side of (2.3.2) therefore gives 


X, = X(B)O(B)Z, = W(B)Z,, 


where 
W(B) = È vi) Bi = (1+ OB +¢°B? +---) (1+ 6B). 
j=0 


By multiplying out the right-hand side or using (2.2.6), we find that 
Wo = land y; = (¢+ 4)d!! for j > 1. 
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As in Example 2.2.1, we conclude that the MA(oo) process 


X =Z + (+9) J pZ; (2.3.3) 
j=1 
is the unique stationary solution of (2.3.1). 
Now suppose that || > 1. We first represent 1/¢ (z) as a series of powers of z with 
absolutely summable coefficients by expanding in powers of z~!, giving (Problem 
2.7) 


1 oe er 
A Shad. 
ZA 3 j 


Then we can apply the same argument as in the case where |¢| < 1 to obtain the 
unique stationary solution of (2.3.1). We let x (B) = — Di ġ-İB-İ and apply x (B) 
to each side of (2.3.2) to obtain 


X, = x(B)6(B)Z, = -067'Z, — (0+ 6) X p Zay. (2.3.4) 
j=l 
If @ = +1, there is no stationary solution of (2.3.1). Consequently, there is no 
such thing as an ARMA(1,1) process with ¢ = +1 according to our definition. 
We can now summarize our findings about the existence and nature of the sta- 
tionary solutions of the ARMA(1,1) recursions (2.3.2) as follows: 


e A stationary solution of the ARMA(1,1) equations exists if and only if @ 4 +1. 


e If |@| < 1, then the unique stationary solution is given by (2.3.3). In this case we 
say that {X,} is causal or a causal function of {Z,}, since X, can be expressed in 
terms of the current and past values Z,, 5 < t. 


e If || > 1, then the unique stationary solution is given by (2.3.4). The solution is 
noncausal, since X, is then a function of Z,, s > t. 


Just as causality means that X, is expressible in terms of Z,, s < t, the dual con- 
cept of invertibility means that Z, is expressible in terms of X,, s < t. We show now 
that the ARMA(1,1) process defined by (2.3.1) is invertible if |0] < 1. To demon- 
strate this, let (z) denote the power series expansion of 1/0(z), i.e., Loa zi, 
which has absolutely summable coefficients. From (2.2.7) it therefore follows that 
&(B)6(B) = 1, and applying &(B) to each side of (2.3.2) gives 


Z, = §(B)O(B)X, = x (B)X;,, 


where 


m(B) = ) xjBİ = (1 — 0B + (—0)} B? +--+) (1—@B). 
j=0 
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By multiplying out the right-hand side or using (2.2.6), we find that 


Z, = X, — ($ +0) X (0) Xj. (2.3.5) 
j=l 
Thus the ARMA(1,1) process is invertible, since Z, can be expressed in terms of the 
present and past values of the process X,, s < t. An argument like the one used to 
show noncausality when |¢| > 1 shows that the ARMA(1,1) process is noninvertible 
when |0| > 1, since then 


oe) 


Z, = -$07'X, + (0 +6) X COT Xj. (2.3.6) 


j=l 


We summarize these results as follows: 


e If|6| < 1, thenthe ARMA(1,1) process is invertible, and Z, is expressed in terms 
of X,,5 < t, by (2.3.5). 


e If |6| > 1, then the ARMA(1,1) process is noninvertible, and Z, is expressed in 
terms of X,, s > t, by (2.3.6). 


Remark 1. In the cases 6 = +1, the ARMA(1,1) process is invertible in the more 
general sense that Z, is a mean square limit of finite linear combinations of X,, s < t, 
although it cannot be expressed explicitly as an infinite linear combination of X,, s < 
t (see Section 4.4 of TSTM). In this book the term invertible will always be used in 
the more restricted sense that Z, = ) 52o 7) X:—j;, where }°7" 9 lmj] < 00. 


Remark 2. If the ARMA(1,1) process {X,} is noncausal or noninvertible with 
|@| > 1, then it is possible to find a new white noise sequence {W,} such that {X,} 
is a causal and noninvertible ARMA(1,1) process relative to {W,} (Problem 4.10). 
Therefore, from a second-order point of view, nothing is lost by restricting attention to 
causal and invertible ARMA(1,1) models. This last sentence is also valid for higher- 
order ARMA models. 


2.4 Properties of the Sample Mean and Autocorrelation Function 


A stationary process {X,} is characterized, at least from a second-order point of view, 
by its mean yz and its autocovariance function y (-). The estimation of jz, y (-), and the 
autocorrelation function p(-) = y(-)/y(O) from observations X1, ..., X, therefore 
plays a crucial role in problems of inference and in particular in the problem of 
constructing an appropriate model for the data. In this section we examine some of 
the properties of the sample estimates x and ((-) of u and p(-), respectively. 
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Proposition 2.4.1 


2.4.1 Estimation of pu 


The moment estimator of the mean n of a stationary process is the sample mean 

X, =n (Xi 4+ Xo+---+X,). (2.4.1) 
It is an unbiased estimator of u, since 

E(X,) =n (EX, + +++ + EXn) = p. 
The mean squared error of X,, is 


E(X, = w? = Var(X,,) 


=n? 5 F Cov, X;) 


i=l j=l 


=n? Yo @-li-s)vG-+ 


i—j=—n 


ae dal 
=n Z "i yo. (2.4.2) 


h=—n 


Now if y(h) —> O as h —> œ, the right-hand side of (2.4.2) converges to zero, 
so that X, converges in mean square to u. If Xc ~ly (A)| < 00, then (2.4.2) 
gives lim„—>oo nVar(Xn) = D nies y(h). We record these results in the following 
proposition. 


If {X;} is a stationary time series with mean pn and autocovariance function y(- ), 
then as n — œ, 


Var(X,) = E(X n- u? >0 if y(n) > 0, 


nE(X,—py > J yh if È y< o. 


|h|<oo h=—0o 


To make inferences about u using the sample mean X, , it is necessary to know 
the distribution or an approximation to the distribution of X,,. If the time series is 
Gaussian (see Definition A.3.2), then by Remark 2 of Section A.3 and (2.4.2), 


n? (X a-u) ~N (0 >. (1 = HI) yo) ; 


|h|<n 


It is easy to construct exact confidence bounds for u using this result if y(-) is 
known, and approximate confidence bounds if it is necessary to estimate y(-) from 
the observations. 
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For many time series, in particular for linear and ARMA models, X, is approxi- 
mately normal with mean yw and variance n~' 5°),,,.. Y (h) for large n (see TSTM, p. 
219). An approximate 95% confidence interval for u is then 


(Xn — 1.96"? /./n, X, + 1.960"? /./n) , (2.4.3) 


where v = } <œ Y (h). Of course, v is not generally known, so it must be estimated 
from the data. The estimator computed in the program ITSM is ô = Jn- yall = 
|h|/ n)?p (h). For ARMA processes this is a good approximation to v for large n. 


An AR(1) model 
Let {X,} be an AR(1) process with mean p, defined by the equations 
X,- h=O(Xi-1-W+Z,, 


where |f| < 1 and {Z,} ~ WN(0, o°). From Example 2.2.1 we have y(h) = 
plo? /(1 — 6”) and hence v = (1+2 X7 o")o?/(1— °) = 07 /(1— p). Approx- 
imate 95% confidence bounds for u are therefore given by x, + 1.960n7'/?/(1 — ¢). 
Since ġ and o are unknown in practice, they must be replaced in these bounds by 
estimated values. 


2.4.2 Estimation of ~(-) and p(-) 


Recall from Section 1.4.1 that the sample autocovariance and autocorrelation func- 
tions are defined by 


n—|h| 


Ph) =n! $ (Xam — Xn) (X: — Xn) (2.4.4) 
t=1 
and 
š p(n) 
h) = ——. 2.4.5 
pth) 50) ( ) 


Both the estimators Ŷ (h) and ((h) are biased even if the factor n~! in (2.4.4) is replaced 
by (n — h)~'. Nevertheless, under general assumptions they are nearly unbiased for 
large sample sizes. The sample ACVF has the desirable property that for each k > 1 
the k-dimensional sample covariance matrix 


PO PM > HRV) 

: PD PO) Pk- 
r= (2.4.6) 

roe Hk=2) + PO | 


is nonnegative definite. To see this, first note that if Ps is nonnegative definite, then 
Tx is nonnegative definite for all k < m. So assume k > n and write 


Ty =n'TT’, 
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where T is the k x 2k matrix 


O --- 0 O YF Yo +s Vy 
O: ae 0 ARo R O0 
T = z . ’ 


Vax =F = 1,...,n, and Y; = Ofori =n+1,...,k. Then for any real k x 1 
vector a we have 


a'Î;a = n`! (a'T)(T'a) > 0, (2.4.7) 


and consequently the sample autocovariance matrix Î, and sample autocorrelation 
matrix 


R, =Ty/y@) (2.4.8) 


are nonnegative definite. Sometimes the factor n~! is replaced by (n — h)~! in the 
definition of Ŷ (h), but the resulting covariance and correlation matrices Î, and R, 
may not then be nonnegative definite. We shall therefore use the definitions (2.4.4) 
and (2.4.5) of 7(h) and p(h). 


Remark 1. The matrices Ñ; and R, are in fact nonsingular if there is at least one 
nonzero Y;, or equivalently if 7 (0) > 0. To establish this result, suppose that 7 (0) > 0 
and I’; is singular. Then there is equality in (2.4.7) for some nonzero vector a, implying 
that a'T = 0 and hence that the rank of T is less than k. Let Y; be the first nonzero 
value of Y;, Y2,..., Y,, and consider the k x k submatrix of T consisting of columns 
(i + 1) through (i + k). Since this matrix is lower right triangular with each diagonal 
element equal to Y;, its determinant has absolute value |Y;|* 4 0. Consequently, the 
submatrix is nonsingular, and T must have rank k, a contradiction. 


Without further information beyond the observed data X,,..., X,, it is impos- 
sible to give reasonable estimates of y (h) and p(h) for h > n. Even for h slightly 
smaller than n, the estimates y (h) and 0(h) are unreliable, since there are so few pairs 
(Xi4n, X+) available (only one if h = n — 1). A useful guide is provided by Box and 
Jenkins (1976), p. 33, who suggest that n should be at least about 50 and h < n/4. 

The sample ACF plays an important role in the selection of suitable models for 
the data. We have already seen in Example 1.4.6 and Section 1.6 how the sample 
ACF can be used to test for iid noise. For systematic inference concerning p(h), 
we need the sampling distribution of the estimator 6(h). Although the distribution 
of ò(h) is intractable for samples from even the simplest time series models, it can 
usually be well approximated by a normal distribution for large sample sizes. For 
linear models and in particular for ARMA models (see Theorem 7.2.2 of TSTM for 
exact conditions) py = (6(1),..., 6(k))’ is approximately distributed for large n as 
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Example 2.4.3 


N (pr, n W), i.e., 
D ~ N(p, nW), (2.4.9) 
where p = (p(1),..., (k))’, and W is the covariance matrix whose (i, j) element 


is given by Bartlett’s formula 


(oe) 


wis = X [ok +iplk+ j) + plk- Dolk + j) +2Wpe7 


k=—00 


— 2p(i) p(k) pk + j) — 2P kpk +i}. 
Simple algebra shows that 


Wij ZAUN ti) + p(k — i) — 2p (i)o (k)} 


x {o(k + j) + plk — j) — 2p) p&)}, (2.4.10) 


which is a more convenient form of w;; for computational purposes. 
iid Noise 
If {X,} ~ ID(0, o°), then p(n) = 0 for |h] > 0, so from (2.4.10) we obtain 
| 1 ifi=j, 
Wij = 
0 otherwise. 


For large n, therefore, 0(1), ..., (A) are approximately independent and identically 
distributed normal random variables with mean 0 and variance n~!. This result is 
the basis for the test that data are generated from iid noise using the sample ACF 
described in Section 1.6. (See also Example 1.4.6.) 


An MA(1) process 
If {X,} is the MA(1) process of Example 1.4.4, i.e., if 
X,=Z,+0Z,;, t=0,+l,..., 
where {Z,} ~ WN(0, o7), then from (2.4.10) 
1—307(1) +4071), ifi=1, 


Wii = 
1 +21), ifi > 1, 
is the approximate variance of n~!/*(6(i) — p(i)) for large n. In Figure 2.1 we have 
plotted the sample autocorrelation function p(k), k = 0, . . . , 40, for 200 observations 
from the MA(1) model 
X,=Z,— Zi, (2.4.11) 


where {Z,} is a sequence of iid N(O, 1) random variables. Here p(1) = —.8/1.64 = 
—.4878 and p(h) = 0 for h > 1. The lag-one sample ACF is found to be 6(1) = 
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Figure 2-1 

The sample autocorrelation 
function of n = 200 
observations of the MA(1) 
process in Example 2.4.3, 
showing the bounds 
+1.96" (1 + 22)". 


Example 2.4.4 


1.0 


0.8 
T 


ACF 


Lag 


—.4333 = —6.128n~'/*, which would cause us (in the absence of our prior knowledge 
of {X;}) to reject the hypothesis that the data are a sample from an iid noise sequence. 
The fact that |6(A)| < 1.96n~'/? for h = 2,..., 40 strongly suggests that the data are 
from a model in which observations are uncorrelated past lag 1. In Figure 2.1 we have 
plotted the bounds +1.96n7! (1 +2p°(1))!⁄?, indicating the compatibility of the data 
with the model (2.4.11). Since, however, o(1) is not normally known in advance, the 
autocorrelations (2), ..., 0(40) would in practice have been compared with the more 
stringent bounds +1.96n~'/” or with the bounds +1.96n 7! (1 +28? (1))!/ in order to 
check the hypothesis that the data are generated by a moving-average process of order 
1. Finally, it is worth noting that the lag-one correlation —.4878 is well inside the 95% 
confidence bounds for p(1) given by 6(1) + 1.96n—!/7(1 — 367(1) + 407(1))!/? = 
—.4333 + .1053. This further supports the compatibility of the data with the model 
X, = Z, —0.8Z,-1. 


An AR(1) process 
For the AR(1) process of Example 2.2.1, 
X,= 6X14 Z,, 
where {Z,} is iid noise and |ġ| < 1, we have, from (2.4.10) with p(h) = @!", 


we Sg (go 25 vai a8 y gr (g gi) 
k=1 


k=i+1 


=(1-¢)(1+ ¢)(1-¢) | — 2i¢7, (2.4.12) 


model ACF p i = (.791)!. 
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i = 1,2,.... In Figure 2.2 we have plotted the sample ACF of the Lake Huron 
residuals y,,..., Yog from Figure 1.10 together with 95% confidence bounds for 
p(i), i =1,..., 40, assuming that data are generated from the AR(1) model 


Y, = .791Y,_, + Z, (2.4.13) 


(see equation (1.4.3)). The confidence bounds are computed from (i) + 1.96n7!/ 
Wi wi”, where w;; is given in (2.4.12) with @ = .791. The model ACF, p(i) = (.791)', 
is also plotted in Figure 2.2. Notice that the model ACF lies just outside the confi- 
dence bounds at lags 2—6. This suggests some incompatibility of the data with the 
model (2.4.13). A much better fit to the residuals is provided by the second-order 


autoregression defined by (1.4.4). 


2.5 Forecasting Stationary Time Series 


We now consider the problem of predicting the values X„+a, h > 0, of a station- 
ary time series with known mean p and autocovariance function y in terms of the 


values {X,,..., Xi}, up to time n. Our goal is to find the linear combination of 
1, Xn, Xn-1,..., X1, that forecasts X,,4, with minimum mean squared error. The best 
linear predictor in terms of 1, X,,..., Xı will be denoted by P, X,,4, and clearly has 
the form 


PiX ain = do + 4 Xn +- +a, Xı. (2.5.1) 
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It remains only to determine the coefficients ap, a1, . . . , an, by finding the values that 
minimize 

S(ao, ---5n) = E(Xngh — do — Xn — +++ Aan Xa). (2.5.2) 
(We already know from Problem 1.1 that PY = E(Y).) Since S is a quadratic 
function of ao, ..., a, and is bounded below by zero, it is clear that there is at least 
one value of (ao, ..., an) that minimizes S and that the minimum (do, ... , an) satisfies 
the equations 

as peer yg Uy . 

EM alge Geese op) (2.5.3) 

ðaj 


Evaluation of the derivatives in equations (2.5.3) gives the equivalent equations 


E [xs — a — Yaxa =0, (2.5.4) 
i=1 

E ce — ao — Pakad = 0, j = 1, see A. (2.5.5) 
i=1 


These equations can be written more neatly in vector notation as 


do = Hu (: — Ya) (2.5.6) 


and 
Man = Yn (h), (2.5.7) 
where 
an = (a1, ..., an)’, D, = [y D] j> 
and 
Wh) = (h), yh +1),... yh +n- Dy. 
Hence, 


PrXnth = b+ Y a Xaaa, (2.5.8) 
i=l 


where a, satisfies (2.5.7). From (2.5.8) the expected value of the prediction error 
Xn+th — PyXn+n 1S Zero, and the mean square prediction error is therefore 


n 


E(Xnan— Pa Xn) = yO) -29 aiyh+i-1)+9 > aye - ja; 


i=l i=l j=l 


= y (0) — aln (h), (2.5.9) 


where the last line follows from (2.5.7). 
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Example 2.5.1 


Remark 1. To show that equations (2.5.4) and (2.5.5) determine P,X„+n uniquely, 
let fa; eg 0,4. gn and fa; ®, j = 0,...,n} be two solutions and let Z be the 
difference between the corresponding predikto, i.e., 


= ay) — ay? F = (a; oO a?) x n+1—j- 
Then 
7 = z(a? - a® 4 De D a®’) xas) , 


But from (2.5.4) and (2.5.5) we have EZ = O and E (ZX„+1-;) = 0 for j =1,...,n 
Consequently, E (Z°) = 0 and hence Z = 0. 


Properties of P„Xn4n: 
1. P,Xngn = MAD, Gi (Xn4i-i— H), where a, = (a), ..., ay)’ satisfies (2.5.7). 


2. E(Xn4n m Pa Xna) = y (0) a a, Yn (h), where Yn(h) = (y (h), e.” y(h +n— 
Dy. 
3. E(Xn+n T Pa Xn+h) =0. 
EU(Xnsh — P, Xn+n) Xj] = 0, j = 1, cee N 


Remark 2. Notice that properties 3 and 4 are exactly equivalent to (2.5.4) and 
(2.5.5). They can be written more succinctly in the form 


E[(Error) x (PredictorVariable)] = 0. (2.5.10) 


Equations (2.5.10), one for each predictor variable, therefore uniquely determine 
P; n Xn+h- 


One-step prediction of an AR(1) series 


Consider now the stationary time series defined in Example 2.2.1 by the equations 
X, = 6X1 + Zr, t= 0,217,005 


where |¢| < 1 and {Z,} ~ WN(0, o°). From (2.5.7) and (2.5.8), the best linear 
predictor of X„+ı in terms of {1, X,,..., X;} is (forn > 1) 


, 
Pa X nı = a Xn, 
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where X, = (X,,..., XV and 
1 $ p "7! ay | (o) 
wy) 1 Q p"? || a o 
; . : . = aii (2.5.11) 
lo go"? o"? 1 | | | |" | 


A solution of (2.5.11) is clearly 
a, 10; Uecas 0Y, 

and hence the best linear predictor of X,,,, in terms of {X,,..., Xa} is 
P, Xni = a, Xn = PX, 


with mean squared error 


2 
f o 
E (Xni — PrX nai)? = VO -l1 = a - oy () = 0°, 


(o 
A simpler approach to this problem is to guess, by inspection of the equation defining 
Xn+1, that the best predictor is #X,,. Then to verify this conjecture, it suffices to check 
(2.5.10) for each of the predictor variables 1, X,, ..., X1. The prediction error of the 
predictor ¢X,, is clearly X„,+ı —@X, = Z,4,. But E(Z,,,Y) = 0 for Y = 1 and for 
Y = X;, j= 1,...,n. Hence, by (2.5.10), $X, is the required best linear predictor 
in terms of 1, X),..., Xp. 


Prediction of Second-Order Random Variables 

Suppose now that Y and W,, ..., Wi are any random variables with finite second 
moments and that the means u = EY, uw; = EW; and covariances Cov(Y, Y), 
Cov(Y, W;), and Cov(W;, W;) are all known. It is convenient to introduce the random 
vector W = (W,,..., W,)’, the corresponding vector of means pw = (Un, ---, UIV, 
the vector of covariances 


y = Cov(Y, W) = (Cov(Y, W,,), Cov(Y, W,-1),..., Cov(Y, W))’, 
and the covariance matrix 


I = Cov(W, W) = [Cov (Wns Wnst-/)] 


n 


ij=l* 


Then by the same arguments used in the calculation of P, X„+n, the best linear pre- 


dictor of Y in terms of {1, W,,,..., W,} is found to be 
P(Y|W) = uy +a'(W — py), (2.5.12) 
where a = (q),...,d,)' is any solution of 


Ta=7. (2.5.13) 
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The mean squared error of the predictor is 
E [Y — P(Y|W))’] = Var(Y) —a’y. (2.5.14) 
Example 2.5.2 Estimation of a missing value 


Consider again the stationary series defined in Example 2.2.1 by the equations 
X, =X + Z, t=0,Æ1,..., 


where |ġ| < 1 and {Z,} ~ WN (0, o°). Suppose that we observe the series at times 
1 and 3 and wish to use these observations to find the linear combination of 1, X4, 
and X; that estimates X, with minimum mean squared error. The solution to this 
problem can be obtained directly from (2.5.12) and (2.5.13) by setting Y = X, and 
W = (X1, X3V. This gives the equations 


with solution 


2, i (o 
SH 


The best estimator of X, is thus 


P(X,|W) = (Xi + X3), 


A 
1+¢ 


with mean squared error 


s 
g2 TE 52 
E[(X2 — P(X2|W))] = —a' = 
[(X2 (X,|W))*] =g a m i+@ 
Li-# | 
The Prediction Operator P (|W) 
For any given W = (W,,..., Wi)’ and Y with finite second moments, we have seen 
how to compute the best linear predictor P(Y|W) of Y in terms of 1, W,,..., Wi 


from (2.5.12) and (2.5.13). The function P(-|W), which converts Y into P(Y|W), 
is called a prediction operator. (The operator P,, defined by equations (2.5.7) and 
(2.5.8) is an example with W = (Xn, X,-1,..., X1)’.) Prediction operators have a 
number of useful properties that can sometimes be used to simplify the calculation 
of best linear predictors. We list some of these below. 
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Example 2.5.3 


Example 2.5.4 


Properties of the Prediction Operator P(-|W): 


Suppose that EU? < oo, EV? < œ, T = cov(W, W), and £, a ,..., a, are 
constants. 


1. P(U|W) = EU + a’(W — EW), where ra = cov(U, W). 
. E[(U — P(U|W))W] = 0 and E[U — P(U|W)] = 0. 

. E[(U — P(U|W))’] = var(U) — a’cov(U, W). 

P(a,U + a2V + |W) = a, P(U|W) + a2 P(V|W) + £. 

» P(X, a W; + BIW) = X; a W; + B. 

P(U|W) = EU if cov(U, W) = 0. 


. P(U|W) = P(P(U|W, V)|W) if V is a random vector such that the compo- 
nents of E(VV’) are all finite. 


Inn Bw WN 


One-step prediction of an AR(p) series 
Suppose now that {X,} is a stationary time series satisfying the equations 
X, =O X-1 +++: + bpXi-p + Zi, t=0,+1,..., 


where {Z,} ~ WN(0, o?) and Z, is uncorrelated with X, for each s < t. Then if 
n > p, we can apply the prediction operator P, to each side of the defining equations, 
using properties (4), (5), and (6) to get 


Py Xn41 = oi Xp, Seta st bpXn+i—p: 


An AR(1) series with nonzero mean 


The time series {Y,} is said to be an AR(1) process with mean n if {X, = Y, — u} isa 
zero-mean AR(1) process. Defining {X,} as in Example 2.5.1 and letting Y, = X,+ u, 
we see that Y, satisfies the equation 


Y, — u = $Y,- — u) + Z. (2.5.15) 


If P Ynn is the best linear predictor of Y„,n in terms of {1, Y,,..., Yı}, then appli- 
cation of P, to (2.5.15) witht =n+1,n+2,... gives the recursions 


Pork — b= @O(PalYnin-1 — -); hee 12 cs 


Noting that P,Y, = Y,, we can solve these equations recursively for P,Y,4+), 
h=1,2,..., to obtain 


PiYnan = U + OW, — u). (2.5.16) 
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The corresponding mean squared error is (from (2.5.14)) 
E(Yn+n — Pr¥ngn)” = y OL — al pa (h)]. (2.5.17) 


From Example 2.2.1 we know that y (0) = o?/(1 — ¢°) and p(h) = $",h > 0. 
Hence, substituting a, = (o", Oyen 25 0)’ (from (2.5.16)) into (2.5.17) gives 


E (Ynyr — Pr Ynn) = 07 (1 — $"")/(1— $’). (2.5.18) 


Remark 3. In general, if {Y,} is a stationary time series with mean u and if {X,} is 
the zero-mean series defined by X, = Y, — n, then since the collection of all linear 
combinations of 1, Y,,,..., Yı is the same as the collection of all linear combinations of 
1, X,,..., X1, the linear predictor of any random variable W in terms of 1, Y,,..., Yı 
is the same as the linear predictor in terms of 1, X,,..., X;. Denoting this predictor 
by P,W and applying P, to the equation Y,,4;, = Xn+n + u gives 


P Ynn = u + P, Xain. (2.5.19) 


Thus the best linear predictor of Y,,,;, can be determined by finding the best linear 
predictor of X„+n and then adding u. Note from (2.5.8) that since E(X,) = 0, P, Xn+n 
is the same as the best linear predictor of X,,,; in terms of X,,..., Xı only. 


2.5.1 The Durbin-Levinson Algorithm 


In view of Remark 3 above, we can restrict attention from now on to zero-mean 
stationary time series, making the necessary adjustments for the mean if we wish 
to predict a stationary series with nonzero mean. If {X;} is a zero-mean stationary 
series with autocovariance function y(-), then in principle the equations (2.5.12) 
and (2.5.13) completely solve the problem of determining the best linear predictor 
P,Xn+n Of Xn+, in terms of {X,,..., X1}. However, the direct approach requires the 
determination of a solution of a system of n linear equations, which for large n may 
be difficult and time-consuming. In cases where the process is defined by a system 
of linear equations (as in Examples 2.5.2 and 2.5.3) we have seen how the linearity 
of P,, can be used to great advantage. For more general stationary processes it would 
be helpful if the one-step predictor P, X„+ı based on n previous observations could 
be used to simplify the calculation of P,.;X,,2, the one-step predictor based on 
n + 1 previous observations. Prediction algorithms that utilize this idea are said to be 
recursive. Two important examples are the Durbin—Levinson algorithm, discussed 
in this section, and the innovations algorithm, discussed in Section 2.5.2 below. 
We know from (2.5.12) and (2.5.13) that if the matrix T, is nonsingular, then 


P,Xn+1 = Q Xn = PniXn +. + Pnn X15 
where 


Pn = Pe We 
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Proof 


Yn = (y (1), ..., y (n)y', and the corresponding mean squared error is 
Un = E (Xn — Pa X n1) = y (0) T ne re 


A useful sufficient condition for nonsingularity of all the autocovariance matrices 
rı, T2,... is y(0) > O and y(h) —> Oas h —> œ. (For a proof of this result see 
TSTM, Proposition 5.1.1.) 


The Durbin-Levinson Algorithm: 


The coefficients n1, ..., Gan can be computed recursively from the equations 
n—-1 
Pnn = [ro on So busy (n z J Uzis (2.5.20) 
j=l 


dni Pn—1,1 Pn—1,n-1 


Patel Wessel ogee 


and 


Un = vaill — 62,], (2.5.22) 


where $11 = y(1)/y(O) and vo = y (0). 


The definition of ¢;, ensures that the equation 


Rin = Pn (2.5.23) 


(where p, = (e(1),..., e(”))’) is satisfied for n = 1. The first step in the proof is to 
show that @,, defined recursively by (2.5.20) and (2.5.21), satisfies (2.5.23) for all n. 
Suppose this is true for n = k. Then, partitioning R+; and defining 


PO := (p(k), p(k — 1),..., pCDY 


and 


pp = (Okk, Pi,k—1, <- -> Per)’; 


we see that the recursions imply 


Pk+1,k+1 


R (r) o o) 
Resi Pep = É É lia Pet k+1 Pk 
k 


= p= Pri Py + Perie pe” 
Pe be — Sevier PE OO + Peri eet 
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as required. Here we have used the fact that if Reo = px, then Reg? = pe . This 
is easily checked by writing out the component equations in reverse order. Since 
(2.5.23) is satisfied for n = 1, it follows by induction that the coefficient vectors @, 
defined recursively by (2.5.20) and (2.5.21) satisfy (2.5.23) for all n. 

It remains only to establish that the mean squared errors 


Un := E(Xn+1 aa Xn)? 


satisfy v9 = y (0) and (2.5.22). The fact that vp = y (0) is an immediate consequence 
of the definition P)X, := E(X,) = 0. Since we have shown that #/,X,, is the best 
linear predictor of X,,,;, we can write, from (2.5.9) and (2.5.21), 


Un = y (0) = P, Yn = y (0) = Pi —1Yn-1 + Pnn Yni = nny (N). 
Applying (2.5.9) again gives 


Un = Vn-1 + Pnn (8i a yn) Py 
and hence, by (2.5.20), 
Un = Unt — Pan Y O) — By_1Yn-1) = vai (1 — Pn) - m 


Remark 4. Under the conditions of the proposition, the function defined by œ (0) = 
1 and a(n) = yy, n = 1,2,..., is known as the partial autocorrelation function 
(PACF) of {X,}. It will be discussed further in Section 3.2. Of particular interest is 
equation (2.5.22), which shows the relation between a(n) and the reduction in the 
one-step mean squared error as the number of predictors is increased from n — 1 
ton. 


2.5.2 The Innovations Algorithm 


The recursive algorithm to be discussed in this section is applicable to all series 
with finite second moments, regardless of whether they are stationary or not. Its 
application, however, can be simplified in certain special cases. 

Suppose then that {X,} is a zero-mean series with E|X,|? < oo for each t and 


E(X;X;) = kG, j). (2.5.24) 


It will be convenient to introduce the following notation for the best one-step predic- 
tors and their mean squared errors: 


A 0, ifn = 1, 
P,-1Xn, ifn =2,3,..., 
and 
Un = E(Xn41 — PX: 
We shall also introduce the innovations, or one-step prediction errors, 


U, = X, —X,- 
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In terms of the vectors U, = (Ui,..., Un)’ and X, = (X1,..., Xay the last equations 
can be written as 
U, = AnXn, (2.5.25) 


where A,, has the form 


a an azı 1 a a 


0 
| 
(If {X,} is stationary, then a;; = —a; with a; as in (2.5.7) with h = 1.) This implies 
that A, is nonsingular, with inverse C, of the form 


| l,n—1 An—1,n—2 Qn-1,n—3 


1 0 0 
O11 1 0 
C= O29 z1 1 


- Oo COO 


p 1,n—1 On 1,n—2 On 1,n—3 


The vector of one-step predictors Š, t= (X1, PiX2,..., P,_,X,)' can therefore be 
expressed as 


Š, = X, — U, =C,U, —U, = ©, (x, t $, ) >, (2.5.26) 
where 
0 0 0 0 
Ai 0 0 0 
O, = 0x2 024 0 0 
; 0 
On 1,n—1 On 1,n—2 On 1,n—3 0 


and X, itself satisfies 
X, =C, (x, = $.) ; (2.5.27) 
Equation (2.5.26) can be rewritten as 


0, ifn = 0, 


Rnt s , 2.5.2 
i X nj (Xj = Rus) f Ab H 1524045 ee) 
j=l 
from which the one-step predictors X,, X2, ... can be computed recursively once 


the coefficients 6;; have been determined. The following algorithm generates these 
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coefficients and the mean squared errors v; = E (Xis — Xa, starting from the 
covariances «k (i, j). 

The Innovations Algorithm: 
The coefficients 9,1, ..., Onn can be computed recursively from the equations 
vo = k(l, 1), 
k-1 
On nk = Vy k(n+1,k+1)— oO 4 jOnn— JP) > O<k<n, 
j=0 
and 
vp =x(n+1, epee Laer: 
j=0 
(It is a trivial matter to solve first for vo, then successively for 011, v1; 922, 621, V2; 933, 
032, 031, V35...) 

Proof See TSTM, Proposition 5.2.2. E 
Remark 5. While the Durbin-Levinson recursion gives the coefficients of 
Xn, ---, X1 in the representation X,4; = ae PnjXn+1-;» the innovations algo- 
rithm gives the coefficients of (X, — Xe oy (Xi - Xi), in the expansion X,,41 = 


Example 2.5.5 


ae Onj (Kaas = Mee j) The latter expansion has a number of advantages deriv- 
ing from the fact that the innovations are uncorrelated (see Problem 2.20). It can also 
be greatly simplified in the case of ARMA(p, q) series, as we shall see in Section 
3.3. An immediate consequence of (2.5.28) is the innovations representation of X„+1 
itself. Thus (defining 6,0 := 1), 


Xn = Xn — Kn Faa = Yu Xni- — Ëm). E 2 eats: 


Recursive prediction of an MA(1) 
If {X,} is the time series defined by 
X, = Z + OZ;_1, {Z,} ~ WN (0, o°) f 


then «(i, j) = 0 for |i — j| > 1, k(i,i) = o7(1 + 0°), and «(i,i + 1) = 80°. 
Application of the innovations algorithm leads at once to the recursions 


Onj m 0, 2 < j < n, 
-1 2 
Ont = v, 100 S 


vo = (1+ 030?, 
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and 
Un = [1 +0? —v,1,0?07] 0°. 
For the particular case 
X,=Z,—0.9Z,1, {Z} ~ WN, 1), 


the mean squared errors v, of X„+ı and coefficients 6,;, 1 < j < n, in the innovations 
representation 


Ên = = dau (x n+1—j — Kosi i) = Unl (x, - X,) 


are found from the recursions to be as follows: 


vo = 1.8100, 

0i = —.4972, v = 1.3625, 

60>, = —.6606, O22 = 0, v2 = 1.2155, 

631 = —.7404, 032 = 0, 033 = 0, V3 = 1.1436, 

641 = —.7870, O49 = 0, 043 = 0, O44 = 0, v4 = 1.1017. 


If we apply the Durbin-Levinson algorithm to the same problem, we find that the 
mean squared errors v, of X„+ı and coefficients ¢,;, 1 < j < n, in the representation 


Xu = > Prj X n+1-j 


are as follows: 


vo = 1.8100, 

oy, = —.4972, vı = 1.3625, 

21 = —.6606, Øn = —.3285, v= 1.2155, 

3, = —.7404, 32 = —.4892, 33 = —.2433, v = 1.1436, 

4, = —.7870, da. = —.5828, 43 = —.3850, 4, = —.1914, vy = 1.1017. 


Notice that as n increases, v, approaches the white noise variance and @,,; approaches 
0. These results hold for any MA(1) process with |0| < 1. The innovations algorithm 
is particularly well suited to forecasting MA(q) processes, since for them 6,; = 0 
for n — j > q. For AR(p) processes the Durbin—Levinson algorithm is usually more 
convenient, since ¢,; = 0 for n — j > p. 


Recursive Calculation of the h-Step Predictors 
For h-step prediction we use the result 


P, (Xn+k = Patk-1Xn+k) =0, k>1. (2.5.29) 
This follows from (2.5.10) and the fact that 


E[(Xn+k = Pa+k-1Xn+k > 0)Xn+j-1] =0, J Shete f 
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Hence, 


PaXn+a = P, Pa+h-1Xn+h 


= P Xia, 
n+h—1 A 
= P, ( > On+h 1,j (Xar j — Xnth )) . 
j=l 
Applying (2.5.29) again and using the linearity of P, we find that 
nt+h—-1 a 
PaXn+h = ye On+h-1,j (Xass j — Xn+h j) (2.5.30) 
j=h 


where the coefficients 6,; are determined as before by the innovations algorithm. 
Moreover, the mean squared error can be expressed as 
E(Xn-+h a Pa Xnr) = EX? n Ea E(P, Xnr) 


n+h—1 


=K(nthnth)— So Daa a (2.5.31) 


j=h 


2.5.3 Prediction of a Stationary Process in Terms of Infinitely Many Past 


Values 
It is often useful, when many past observations X,,,..., Xo, X1,-.-,Xn (m < 0) 
are available, to evaluate the best linear predictor of X,,, in terms of 1, Xn,..., Xo, 


..., Xn. This predictor, which we shall denote by Pn.»Xn+n, can easily be evaluated 
by the methods described above. If |m| is large, this predictor can be approximated 
by the sometimes more easily calculated mean square limit 


PaXnth = lim PrnnXnth- 
m—>—Co 


We shall refer to P, as the prediction operator based on the infinite past, {X,, 
— œ < t < n}. Analogously we shall refer to P, as the prediction operator based 
on the finite past, {X,,..., Xn}. (Mean square convergence of random variables is 
discussed in Appendix C.) 


Determination of P, X „4n 
Like P,Xn+n, the best linear predictor P ,Xn+n when {X,,} is a zero-mean stationary 
process with autocovariance function y (-) is characterized by the equations 


E| (Xnin— PXnin) Xe] =0, FAR 


If we can find a solution to these equations, it will necessarily be the uniquely defined 
predictor P,X„+n. An approach to this problem that is often effective is to assume 
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that Pika can be expressed in the form 
fa CO 
P,Xnth = > aj Xnti-j, 
j=l 


in which case the preceding equations reduce to 


CO 
E (x ~ Faxe) Keats SO) Vex 
j=l 


or equivalently, 


(oe) 


X vG-paj=yvhti-), i=1,2,.... 


j=l 


This is an infinite set of linear equations for the unknown coefficients œ; that determine 
P ,Xn+n, provided that the resulting series converges. 


Properties of P.: 
Suppose that EU? < 00, EV? < ov, a, b, and c are constants, and r =Cov(W, W). 
1. E[(U — P,(U)X;]=0,j <n. 
2. P (aU + bV +c) = aP, (U) + bP,(V) + c. 
3. P, (U ) = U if U is a limit of linear combinations of X;, j < n. 
4. P,(U) = EU if Cov (U, X;) = 0 forall j <n. 


These properties can sometimes be used to simplify the calculation of 
P,,Xn+n, notably when the process {X,} is an ARMA process. 


Example 2.5.7 Consider the causal invertible ARMA(1,1) process {X,} defined by 
Xı — OX, = Z, +0 Z, {Z,} ~ WN(0, o°). 
We know from (2.3.3) and (2.3.5) that we have the representations 
Xai = Zazi + (Q +0) X Ø Zn- 
j=l 


and 


Zati = Xani — (Q +0) > 0V7 Xing. 


j=l 
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Applying the operator P, to the second equation and using the properties of P, gives 


É Xa = ($ +0) (0) Xni- 
j=l 


Applying the operator P, to the first equation and using the properties of P, gives 


(oe) 


BX = (@+0) 9) p Zna 


j=l 
Hence, 


Xn+ı E PaXn+ı = Lats 


and so the mean squared error of the predictor P, Xay is E Z? 5 o. 


2.6 The Wold Decomposition 


Consider the stationary process 
X, = Acos(@t) + Bsin(at), 


where œ € (0, x) is constant and A, B are uncorrelated random variables with mean 
0 and variance o7. Notice that 


Xn, = (2cosa)X,, 1— Xn 2= P, 1Xn, n=0,21,...:, 


so that X, — PiX, = 0 for all n. Processes with the latter property are said to be 


deterministic. 
The Wold Decomposition: 
If {X,} is a nondeterministic stationary time series, then 
X, =) YZ- +V, (2.6.1) 
j=0 

where 
1. yo = 1 and Yr) Y? < 00, 
2. {Z,} ~ WN (0, o°), 
3. Cov(Z,, V,) = 0 for all s and żt, 
4. Z, = P, Z, for all t, 
5. V = P, V, for all s and t, and 
6. {V,} is deterministic. 
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Example 2.6.1 


Problems 


Here as in Section 2.5, PY denotes the best predictor of Y in terms of linear com- 
binations, or limits of linear combinations of 1, Xs, -oo < s < t. The sequences 
{Z,}, {Wj}, and {V,} are unique and can be written explicitly as Z, = X, — Pe iy 
Yj = E(X,Z,_;)/E(Z?), and V, = X, — 2) Yj Zij. (See TSTM, p. 188.) For 
most of the zero-mean stationary time series dealt with in this book (in particular for 
all ARMA processes) the deterministic component V, is 0 for all t, and the series is 
then said to be purely nondeterministic. 


If X, = U, + Y, where {U,} ~ WN (0, v?), E(U,Y) = 0 for all t, and Y has mean 
0 and variance t”, then P,_;X, = Y, since Y is the mean square limit as s > oo of 
[X,-1+---+ X;_s]/s, and E[(X; — Y)X;] = 0 for all s < t — 1. Hence the sequences 
in the Wold decomposition of {X,} are given by Z, = U;, Yo = 1, Y; = 0 for j > 0, 
and V, = Y. 


2.1. Suppose that X,, X,..., is a stationary time series with mean u and ACF p(-). 
Show that the best predictor of X„+n of the form a X,„ +b is obtained by choosing 
a = p(h) and b = w(1 — p(h)). 


2.2. Show that the process 
X, = Acos(ot)+ Bsin(wat), t=0,+1,... 


(where A and B are uncorrelated random variables with mean 0 and variance 1 
and w is a fixed frequency in the interval [0, 7 ]), is stationary and find its mean 
and autocovariance function. Deduce that the function «(h) = cos(wh),h = 
0, +1, ..., is nonnegative definite. 


2.3. a. Find the ACVF of the time series X, = Z, + .3Z,_; — .4Z;_2, where {Z,} ~ 
WN(O, 1). 


b. Find the ACVF of the time series Y, = Z, — LIZ 5 — 1.6Ž,2, where 
{Z,} ~ WN (0, .25). Compare with the answer found in (a). 


2.4. It is clear that the function «(h) = 1,h = 0,+1,..., is an autocovariance 
function, since it is the autocovariance function of the process X, = Z,t = 
0, +1, ..., where Z is a random variable with mean 0 and variance 1. By iden- 


tifying appropriate sequences of random variables, show that the following 
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2.5. 


2.6. 
2.7. 


2.8. 


2.9. 


functions are also autocovariance functions: 


(a) «(h) = (-1)" 


(b) x(h)=14 cos (™*) H cos (=) 


1, ifh =0, 


(c) KA) = 104, ifh=+1, 
0, otherwise. 


Suppose that {X,, t = 0, +1,...} is stationary and that |09| < 1. Show that for 
each fixed n the sequence 


m 


Sa Yo OX pay 
j=l 


is convergent absolutely and in mean square (see Appendix C) as m — oo. 
Verify equations (2.2.6). 


Show, using the geometric series 1/(1 — x) = D x/ for |x| < 1, that 1/(1 — 
oz) = =A ġ`İz™ for |ġ| > 1 and |z| > 1. 


Show that the autoregressive equations 
X, =X +Z, t=0,+1,..., 


where {Z,} ~ WN(0, o°) and |¢| = 1, have no stationary solution. HINT: 
Suppose there does exist a stationary solution {X,} and use the autoregressive 
equation to derive an expression for the variance of X, — pt! X t-n—1 that con- 
tradicts the stationarity assumption. 


Let {Y,} be the AR(1) plus noise time series defined by 
Y, = X,+ W,, 

where {W,} ~ WN(0, 02), {X;} is the AR(1) process of Example 2.2.1, i.e., 
X, — @X,-1 = Z, {Z} ~ WN (0, 02), 

and E(W,Z,) = 0 for all s and t. 


a. Show that {Y,} is stationary and find its autocovariance function. 
b. Show that the time series U, := Y, — @Y;_, is 1-correlated and hence, by 
Proposition 2.1.1, is an MA(1) process. 


c. Conclude from (b) that {Y,} is an ARMA(1,1) process and express the three 


parameters of this model in terms of ¢, 07, and ož. 


> w? 
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2.10. 


2.11. 


2.12. 


2.13. 


2.14. 


Use the program ITSM to compute the coefficients y; and 7j, j =1,...,5,in 
the expansions 


CO 
X,= > WjZi-j 
j=0 


and 


ioe) 
Zi = ome 
j=0 


for the ARMA(1,1) process defined by the equations 
X, — 0.5X,-1 = Z, + 0.5Z,-1, {Z} ~ WN (0, o°). 


(Select File>Project>New>Univariate, then Model>Specify. In the result- 
ing dialog box enter 1 for the AR and MA orders, specify (1) = 0 (1) = 0.5, 
and click OK. Finally, select Model>AR/MA Infinity>Default lag and the 
values of y; and x; will appear on the screen.) Check the results with those 
obtained in Section 2.3. 


Suppose that in a sample of size 100 from an AR(1) process with mean u, 
$ = .6, and o? = 2 we obtain Xj9) = .271. Construct an approximate 95% 
confidence interval for u. Are the data compatible with the hypothesis that 
pu =0? 


Suppose that in a sample of size 100 from an MA(1) process with mean u, 
6 = —.6, and o? = 1 we obtain X19) = .157. Construct an approximate 95% 
confidence interval for u. Are the data compatible with the hypothesis that 
f= 0? 


Suppose that in a sample of size 100, we obtain (1) = .438 and 0(2) = .145. 


a. Assuming that the data were generated from an AR(1) model, construct 
approximate 95% confidence intervals for both p(1) and p(2). Based on 
these two confidence intervals, are the data consistent with an AR(1) model 
with @ = .8? 


b. Assuming that the data were generated from an MA(1) model, construct 
approximate 95% confidence intervals for both p(1) and p(2). Based on 
these two confidence intervals, are the data consistent with an MA(1) model 
with 0 = .6? 

Let {X,} be the process defined in Problem 2.2. 

a. Find P; X2 and its mean squared error. 

b. Find P,X3 and its mean squared error. 


c. Find P,X„+ı and its mean squared error. 
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2.15. 


2.16. 


2.17. 


2.18. 


Suppose that {X,, t = 0, +1, ...} is a stationary process satisfying the equations 
X, = Pı Xı-ı Faer PpXı-p + Z, 


where {Z,} ~ WN(0, o°) and Z, is uncorrelated with X, for each s < t. Show 
that the best linear predictor P,,X,,., of X,,; interms of 1, X,,..., Xn, assuming 
n> p,is 


P, Xn+1 = PX, pera PpXn+1-p- 
What is the mean squared error of P, X„+1? 


Use the program ITSM to plot the sample ACF and PACF up to lag 40 of the 
sunspot series D,,t = 1, 100, contained in the ITSM file SUNSPOTS.TSM. 
(Open the project SUNSPOTS.TSM and click on the second yellow button at 
the top of the screen to see the graphs. Repeated clicking on this button will 
toggle between graphs of the sample ACF, sample PACF, and both. To see the 
numerical values, right-click on the graph and select Info.) Fit an AR(2) model 
to the mean-corrected data by selecting Model>Estimation>Preliminary 
and click Yes to subtract the sample mean from the data. In the dialog box that 
follows, enter 2 for the AR order and make sure that the MA order is zero and 
that the Yule-Walker algorithm is selected without AICC minimization. Click 
OK and you will obtain a model of the form 


X, =X 1 +X 2+Z, where {Z,} ~ WN (0,07), 


for the mean-corrected series X, = D, — 46.93. Record the values of the es- 
timated parameters 1, %2, and o°. Compare the model and sample ACF and 
PACF by selecting the third yellow button at the top of the screen. Print the 
graphs by right-clicking and selecting Print. 


Without exiting from ITSM, use the model found in the preceding problem to 
compute forecasts of the next ten values of the sunspot series. (Select Fore- 
casting>ARMA, make sure that the number of forecasts is set to 10 and the box 
Add the mean to the forecasts is checked, and then click OK. You will 
see a graph of the original data with the ten forecasts appended. Right-click on 
the graph and then on Info to get the numerical values of the forecasts. Print 
the graph as described in Problem 2.16.) The details of the calculations will be 
taken up in Chapter 3 when we discuss ARMA models in detail. 


Let {X,} be the stationary process defined by the equations 


X,=Z,—-0Z a, t=0,1,..., 
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2.19. 


2.20. 


2.21. 


2.22. 


where |0| < 1 and {Z,} ~ WN(0, o°). Show that the best linear predictor 
P, Xn41 Of Xn+1 based on {X;, —œ0 < j < n}is 


[o6] 
Bit = =) i: 
j=l 


What is the mean squared error of the predictor P, Xn? 


If {X,} is defined as in Problem 2.18 and 6 = 1, find the best linear predictor 
P,Xn4 of X41, in terms of X,,..., Xn. What is the corresponding mean squared 
error? 


In the innovations algorithm, show that for each n > 2, the innovation X,, — Xx a 
is uncorrelated with X,,..., X„-1. Conclude that X,, — X, is uncorrelated with 
the innovations X, — X,,..., Xn-1 — Xp_1. 


Let X,, X2, X4, X; be observations from the MA(1) model 
Xi =Z, +0Z1, {Z} ~ WN (0, o°). 


a. Find the best linear estimate of the missing value X; in terms of X, and X2. 
b. Find the best linear estimate of the missing value X3 in terms of X4 and Xs. 


c. Find the best linear estimate of the missing value X3 in terms of X1, X2, X4, 
and Xs. 


d. Compute the mean squared errors for each of the estimates in (a), (b), and 


(c). 


Repeat parts (a)—(d) of Problem 2.21 assuming now that the observations X4, 
X2, X4, X; are from the causal AR(1) model 


X: =X +Z, {Z:} ~ WN(0,0’). 


ARMA Models 


3.1 ARMA(p, q) Processes 
3.2. The ACF and PACF of an ARMA(p, q) Process 
3.3 Forecasting ARMA Processes 


In this chapter we introduce an important parametric family of stationary time series, 
the autoregressive moving-average, or ARMA, processes. For a large class of autoco- 
variance functions y (-) itis possible to find an ARMA process {X,} with ACVF yx (-) 
such that y(-) is well approximated by y x(-). In particular, for any positive integer 
K, there exists an ARMA process {X;} such that yy(h) = y (h) for h =0,1,..., K. 
For this (and other) reasons, the family of ARMA processes plays a key role in the 
modeling of time series data. The linear structure of ARMA processes also leads 
to a substantial simplification of the general methods for linear prediction discussed 
earlier in Section 2.5. 


3.1 ARMA(p, q) Processes 


Definition 3.1.1 


In Section 2.3 we introduced an ARMA(1,1) process and discussed some of its key 
properties. These included existence and uniqueness of stationary solutions of the 
defining equations and the concepts of causality and invertibility. In this section we 
extend these notions to the general ARMA (p, q) process. 


{X,} is an ARMA(p, q) process if {X,} is stationary and if for every t, 
X;— oi Xi-1 Speer bpXi—p = Z +0 Zi +++ 04 Zi-q» (3.1.1) 


where {Z,} ~ WN(0, o?) and the polynomials (1 — ız —...— pz”) and (1 + 
Qiz E iee + 0z ) have no common factors. 
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The process {X,} is said to be an ARMA (p, q) process with mean n if {X, — u} 
is an ARMA (p, q) process. 
It is convenient to use the more concise form of (3.1.1) 


p(B)X; = O(B)Z,, (3.1.2) 
where $(-) and 6(-) are the pth and gth-degree polynomials 


o(z) =1—¢)z—-+++ — pz” 


and 
OZ) = 1+ Oz + + 0,27, 


and B is the backward shift operator (B/X, = X,-;, BİZ, = Z,-;, j =0,+1,...). 
The time series {X,} is said to be an autoregressive process of order p (or AR(p)) 
if 0(z) = 1, and a moving-average process of order q (or MA(q)) if d(z) = 1. 

An important part of Definition 3.1.1 is the requirement that {X,} be stationary. 
In Section 2.3 we showed, for the ARMA(1,1) equations (2.3.1), that a stationary 
solution exists (and is unique) if and only if ¢; ~ +1. The latter is equivalent to the 
condition that the autoregressive polynomial ¢(z) = 1 — ġız Æ 0 for z = +1. The 
analogous condition for the general ARMA (p, q) process is (z) = 1 — ¢@,;z —---— 
pz? #0 for all complex z with |z| = 1. (Complex z is used here, since the zeros of 
a polynomial of degree p > 1 may be either real or complex. The region defined by 
the set of complex z such that |z| = 1 is referred to as the unit circle.) If ø (z) 4 0 for 
all z on the unit circle, then there exists 5 > 0 such that 


1 oe 
rot 5 xjz’ for 1— ô < |z| < 1 +ô, 
j=-0o 


and ae |x;| < oo. We can then define 1/¢ (B) as the linear filter with absolutely 
summable coefficients 


1 = i 
-a Bİ. 
p(B) oe A 


Applying the operator x (B) := 1/¢ (B) to both sides of (3.1.2), we obtain 


X, = X(B)O(B)X: = x(B)0(B)Z, = W(B)Z, = y WiZij, (3.1.3) 


j==00 


where y(z) = x(z)0(z) = 2an y;z/. Using the argument given in Section 2.3 
for the ARMA(1,1) process, it follows that Y (B)Z, is the unique stationary solution 
of (3.1.1). 
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Existence and Uniqueness: 


A stationary solution {X,} of equations (3.1.1) exists (and is also the unique sta- 
tionary solution) if and only if 


P =1- pz- -opz #0 forall |z| = 1. (3.1.4) 


In Section 2.3 we saw that the ARMA(1,1) process is causal, i.e., that X, can be 
expressed in terms of Z,, s < t, if and only if |ġı| < 1. For a general ARMA (p, q) 
process the analogous condition is that ø (z) 4 0 for |z| < 1, i.e., the zeros of the 
autoregressive polynomial must all be greater than 1 in absolute value. 


Causality: 


An ARMA (p, q) process {X,} is causal, or a causal function of {Z,}, if there 
exist constants {y;} such that pee lwj| < co and 


X, =) WjZ,-; for all t. (3.1.5) 


j=0 
Causality is equivalent to the condition 


o(z) =1— gz —---— pz” £0 for all |z| < 1. (3.1.6) 


The proof of the equivalence between causality and (3.1.6) follows from ele- 
mentary properties of power series. From (3.1.3) we see that {X,} is causal if and 
only if x (z) := 1/¢(z) = Dy x;z/ (assuming that ¢ (z) and 8 (z) have no common 
factors). But this, in turn, is equivalent to (3.1.6). 

The sequence {y;} in (3.1.5) is determined by the relation y (z) = ae yjz! = 
0(z)/ġ (z), or equivalently by the identity 


(1 — biz — +++ — bp2”) Wot iz te) HLF OZ + + Oz" 
Equating coefficients of z’, j = 0, 1,..., we find that 

1 = Yo, 

4 = Yı — Yogi, 


02 = Y2 — Wid — Vog, 


or equivalently, 


P 
vj — > bi 4 = 9 j=0,1,...; (3.1.7) 


k=1 
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where 6) := 1, 0; := O for j > q, and y; := 0 for j < 0. 
Invertibility, which allows Z, to be expressed in terms of X,, s < t, has a similar 
characterization in terms of the moving-average polynomial. 


Invertibility: 


An ARMA(p, q) process {X,} is invertible if there exist constants {7x ;} such that 
m0 lj] < 00 and 


[00] 
Z = S aR for all f. 
j=0 
Invertibility is equivalent to the condition 


A(z) = 1+62+---+6,24 £0 for all |z| < 1. 


Interchanging the roles of the AR and MA polynomials, we find from (3.1.7) that 
the sequence {7r;} is determined by the equations 


q 
m+ Ý nj = 0; f=0,1,..., (3.1.8) 
k=1 


where œo := —1, 6; := 0 for j > p, and x; := 0 for j < 0. 


Example 3.1.1 An ARMA(1,1) process 
Consider the ARMA(1,1) process {X,} satisfying the equations 


X, — 5X1 = Z, + 4Z,1, {Z} ~ WN (0,07). (3.1.9) 


Since the autoregressive polynomial ¢(z) = 1 — .5z has a zero at z = 2, which is 
located outside the unit circle, we conclude from (3.1.4) and (3.1.6) that there exists 
a unique ARMA process satisfying (3.1.9) that is also causal. The coefficients {y;} 
in the MA (co) representation of {X,} are found directly from (3.1.7): 


Ww =1, 

W= 44.5, 

y2 = .5(.4 + .5), 

yi SN UA: j=1,2,.... 


The MA polynomial 0 (z) = 1 + .4z has a zero at z = —1/.4 = —2.5, which is also 
located outside the unit circle. This implies that {X,} is invertible with coefficients 
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Example 3.1.3 


{z;} given by (see (3.1.8)) 
To = 1, 
mı = —(.4 + .5), 
m = —(.4 + .5)(—.4), 
z; =-—(4+.5)(-.4, j=1,2,.... 


(A direct derivation of these formulas for {y;} and {m;} was given in Section 2.3 
without appealing to the recursions (3.1.7) and (3.1.8).) 


An AR(2) process 
Let {X,} be the AR(2) process 


X,=.7X-1-.1X2+Z,, {Z:}~WN(0,0"). 


The autoregressive polynomial for this process has the factorization (z) = 1—.7z+ 
.1z? = (1 — .5z)(1 — .2z), and is therefore zero at z = 2 and z = 5. Since these 
zeros lie outside the unit circle, we conclude that {X,} is a causal AR(2) process with 
coefficients {w;} given by 


w= 1, 
yw =.7, 
Ww =.7 —.1, 


Wj = Typa ly 7 =2,3,.... 


While it is a simple matter to calculate y; numerically for any j, it is possible also 
to give an explicit solution of these difference equations using the theory of linear 
difference equations (see TSTM, Section 3.6). 


The option Model>Specify of the program ITSM allows the entry of any causal 
ARMA(p, q) model with p < 28 and q < 28. This option contains a causality check 
and will immediately let you know if the entered model is noncausal. (A causal model 
can be obtained by setting all the AR coefficients equal to .001.) Once a causal model 
has been entered, the coefficients y; in the MA (c0) representation of the process can 
be computed by selecting Model>AR/MA Infinity. This option will also compute 
the AR(oo) coefficients xj, provided that the model is invertible. 


An ARMA(2,1) process 
Consider the ARMA(2,1) process defined by the equations 


X, — .75X,-1 + 5625X;-9 = Z, + 1.25Z,1, {Z} ~ WN (0, o°). 
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The AR polynomial ¢(z) = 1 — .75z + .5625z? has zeros at z = 2(1 + iv/3) /3, 
which lie outside the unit circle. The process is therefore causal. On the other hand, 
the MA polynomial 6(z) = 1 + 1.25z has a zero at z = —.8, and hence {X;} is not 
invertible. 


Remark 1. It should be noted that causality and invertibility are properties not of 
{X;} alone, but rather of the relationship between the two processes {X,} and {Z;} 
appearing in the defining ARMA equations (3.1.1). 


Remark 2. If {X,} is an ARMA process defined by ¢(B)X, = 0(B)Z,, where 
0(z) Æ Oif |z| = 1, then it is always possible (see TSTM, p. 127) to find polynomials 
D2) and 6(z) and a white noise sequence {W,} such that b(B)X; = 6(B)W, and 6(z) 
and $(z) are nonzero for |z| < 1. However, if the original white noise sequence {Z,} 
is iid, then the new white noise sequence will not be iid unless {Z,} is Gaussian. 


In view of the preceding remark, we will focus our attention principally on causal 
and invertible ARMA processes. 


3.2 The ACF and PACF of an ARMA(p, q) Process 


In this section we discuss three methods for computing the autocovariance function 
y(-) of a causal ARMA process {X,}. The autocorrelation function is readily found 
from the ACVF on dividing by y (0). The partial autocorrelation function (PACF) is 
also found from the function y(-). 


3.2.1 Calculation of the ACVF 
First we determine the ACVF y(-) of the causal ARMA (p, q) process defined by 


$(B)X, = 0(B)Z,, {Z} ~ WN (0, 0°), G24) 
where @(z) = 1 — giz —--- — bz? and 6(z) = 14+ 0ız + - -- + 6,24. The causality 
assumption implies that 


X= WZ, (3.2.2) 
j=0 


where $>% Wjz! = A(z)/(2), Izl < 1. The calculation of the sequence {yj} was 
discussed in Section 3.1. 
First Method. From Proposition 2.2.1 and the representation (3.2.2), we obtain 


y (h) = E(X 41X) = 0? YO Wisin: (3.2.3) 
j=0 
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Example 3.2.1 The ARMA(1,1) process 
Substituting from (2.3.3) into (3.2.3), we find that the ACVF of the process defined 
by 
Xı— QX = Z,+0Z,1, {Z} ~ WN (0, o°) , (3.2.4) 
with || < 1 is given by 
v0) =07 Soy; 
j=0 
=o? Í (+9) So 
j=0 
ty 
E PEE ELN 
1-¢ 
y(1) = 07 > Win; 
j=0 
=o? [o +++ ooo 
j=0 
0+9 
ad, f f 
=O: jose. 1_¢ |: 
and 
yh) =¢"ya), h>2. 
Example 3.2.2 The MA(q) process 


For the process 
X,=Z,+OZ1+-++-+6,Z,-q, {Z} ~ WN (0, o°), 
equation (3.2.3) immediately gives the result 
‘ q-|h| 
si o 2 Oji if Ih] <q, 
0, if |h| >q, 


where 6p is defined to be 1. The ACVF of the MA(q) process thus has the distinctive 
feature of vanishing at lags greater than g. Data for which the sample ACVF is 
small for lags greater than g therefore suggest that an appropriate model might be a 
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moving average of order q (or less). Recall from Proposition 2.1.1 that every zero- 
mean stationary process with correlations vanishing at lags greater than g can be 
represented as a moving-average process of order q or less. 


Second Method. If we multiply each side of the equations 
X, — oXi-1 eee PpXı-p = Z +0 Zi t+ Oq Zi—q; 


by X-k, k = 0, 1, 2, ..., and take expectations on each side, we find that 


yk) -pyk -1)- bv k= p) =o? Oj), O<k<m, (3.2.5) 


j=0 


and 


y(k) — div(k—1)—---—bpy(kK—-p)=0, k>m, (3.2.6) 


where m = max(p,qg+1), Y; := Ofor j < 0, := 1,and 0; := O for j ¢ {0,..., q}. 
In calculating the right-hand side of (3.2.5) we have made use of the expansion (3.2.2). 
Equations (3.2.6) are a set of homogeneous linear difference equations with constant 
coefficients, for which the solution is well known (see, e.g., TSTM, Section 3.6) to 
be of the form 


y(h) = a8" +aé"+---+ ape", h>m-p, (3.2.7) 
where å, ..., Ep are the roots (assumed to be distinct) of the equation (z) = 0, and 
a),..., @, are arbitrary constants. (For further details, and for the treatment of the case 
where the roots are not distinct, see TSTM, Section 3.6.) Of course, we are looking for 
the solution of (3.2.6) that also satisfies (3.2.5). We therefore substitute the solution 
(3.2.7) into (3.2.5) to obtain a set of m linear equations that then uniquely determine 
the constants a@,...,a@, and the m — p autocovariances y(h),0 < h < m — p. 

The ARMA(1,1) process 
For the causal ARMA(1,1) process defined in Example 3.2.1, equations (3.2.5) are 

yO) — ¢y(1) =07(1+06+4)) (3.2.8) 
and 

y(1) — ¢y(0) = o°. (3.2.9) 
Equation (3.2.6) takes the form 

y(k) —oy(kK-1)=0, k>2. (3.2.10) 
The solution of (3.2.10) is 

yh) =a", h>1. 
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Substituting this expression for y (h) into the two preceding equations (3.2.8) and 
(3.2.9) gives two linear equations for œ and the unknown autocovariance y (0). These 
equations are easily solved, giving the autocovariances already found for this process 
in Example 3.2.1. 

Example 3.2.4 The general AR(2) process 
For the causal AR(2) process defined by 
(1 — 7'B) (1 —&5'B) X; = Z, El, ll > 1,41 4b, 
we easily find from (3.2.7) and (3.2.5) using the relations 
pı =é HE 
and 
p = —§ 18! 
that 
o? EZE? 
y(h) = EDE- E Dt] 6211) 
Ee- DE- i ? z 
Figures 3.1-3.4 illustrate some of the possible forms of y (-) for different values of £ 
and &. Notice that in the case of complex conjugate roots €, = re’? and & = re~"®, 
0 < 6 < x, we can write (3.2.11) in the more illuminating form 
o’r* -r sin(hé + y) 
h) = : 3.2.12 
UE EGE = On eos 28 SAV Se Gar) 
where 
r+ 
tan y = =— tané (3.2.13) 
r7—1 
and cos w has the same sign as cos 6. Thus in this case y (-) has the form of a damped 
sinusoidal function with damping factor r~! and period 27/0. If the roots are close 
to the unit circle, then r is close to 1, the damping is slow, and we obtain a nearly 
sinusoidal autocovariance function. 
Third Method. The autocovariances can also be found by solving the first p + 1 
equations of (3.2.5) and (3.2.6) for y(O)..., y(p) and then using the subsequent 
equations to solve successively for y(p + 1), y(p + 2),.... This is an especially 
convenient method for numerical determination of the autocovariances y (h) and is 
used in the option Model>ACF/PACF>Mode1 of the program ITSM. 
Example 3.2.5 Consider again the causal ARMA(1,1) process of Example 3.2.1. To apply the third 


method we simply solve (3.2.8) and (3.2.9) for y (0) and y (1). Then y (2), y(3),... 
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Figure 3-1 

The model ACF of the AR(2) 
series of Example 3.2.4 
with & = 2 and & = 5. 


Figure 3-2 

The model ACF of the AR(2) 
series of Example 3.2.4 
with & = 10/9 and & = 2. 
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can be found successively from (3.2.10). It is easy to check that this procedure gives 
the same results as those obtained in Examples 3.2.1 and 3.2.3. 


3.2.2 The Autocorrelation Function 


Recall that the ACF of an ARMA process {X;} is the function p(-) found immediately 
from the ACVF y(-) as 


(h) 
p(h) =. 
y (0) 
Likewise, for any set of observations {x,,..., Xn}, the sample ACF ĝ(-) is computed 
as 
P v(h 
p(h) = ae 
y (0) 
The Sample ACF of an MA(q) Series. Given observations {x,,..., Xn} of a time 


series, one approach to the fitting of a model to the data is to match the sample ACF 
of the data with the ACF of the model. In particular, if the sample ACF p(h) is sig- 
nificantly different from zero for 0 < h < q and negligible for h > q, Example 
3.2.2 suggests that an MA(q) model might provide a good representation of the data. 
In order to apply this criterion we need to take into account the random variation 
expected in the sample autocorrelation function before we can classify ACF values 
as “negligible.” To resolve this problem we can use Bartlett’s formula (Section 2.4), 
which implies that for a large sample of size n from an MA(q) process, the sample 
ACF values at lags greater than g are approximately normally distributed with means 
O and variances wy,/n = (1+2p7(1)+---+27(q))/n. This means that if the sample 
is from an MA(q) process and if h > q, then 6(h) should fall between the bounds 
+1.96//wpn/n with probability approximately 0.95. In practice we frequently use the 
more stringent values +1.96/./n as the bounds between which sample autocovari- 
ances are considered “negligible.” A more effective and systematic approach to the 
problem of model selection, which also applies to ARMA(p, q) models with p > 0 
and q > 0, will be discussed in Section 5.5. 


3.2.3 The Partial Autocorrelation Function 


The partial autocorrelation function (PACF) of an ARMA process {X;} is the 
function a(-) defined by the equations 


a(0)=1 
and 


a(h) = Phr, h > 1, 
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where Ønn is the last component of 
Pr = Ti Yh (3.2.14) 
D, =(v@- DV ja and ya = [y 0), v@Q),.... yT. 
For any set of observations {x1, . . . , Xn} with x; Æ x; for some i and j, the sample 
PACF @(/) is given by 
&(0)=1 
and 
Âh) = bmn, hzl, 
where Phn is the last component of 
on = r'r- (3.2.15) 
We show in the next example that the PACF of a causal AR(p) process is zero for 
lags greater than p. Both sample and model partial autocorrelation functions can be 
computed numerically using the program ITSM. Algebraic calculation of the PACF 
is quite complicated except when q is zero or p and q are both small. 
It can be shown (TSTM, p. 171) that nn is the correlation between the prediction 
errors Xn = P(X,l|Xı, EE Xn-1) and Xo = P(XolXı, aED Xpn-1). 
Example 3.2.6 The PACF of an AR(p) process 


For the causal AR(p) process defined by 
X, — X1 = —bpXi-p = Z, {Z} ~ WN (0, 0°), 


we know (Problem 2.15) that for h > p the best linear predictor of X,+, in terms of 
1, Xi,..., Xn is 


Kasi = Xn + Xr- H+ PpXhn+1-p- 


Since the coefficient nn of X; is ¢, if h = p and O if h > p, we conclude that the 
PACF g (-) of the process {X,} has the properties 


a(p) = $p 
and 
a(h) = 0 for h > p. 


For h < p the values of a(h) can easily be computed from (3.2.14). For any 
specified ARMA model the PACF can be evaluated numerically using the option 
Model>ACF/PACF>Model of the program ITSM. 
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Example 3.2.7 


Example 3.2.8 


The PACF of an MA(1) process 


For the MA(1) process, it can be shown from (3.2.14) (see Problem 3.12) that the 
PACF at lag h is 


a(h) = bmn = —(—0)"/ (1 +0? + -- -+ 0”). 


The Sample PACF of an AR(p) Series. If {X,} is an AR(p) series, then the sample 
PACF based on observations {x,,..., Xn} should reflect (with sampling variation) the 
properties of the PACF itself. In particular, if the sample PACF å(h) is significantly 
different from zero for 0 < h < p and negligible for h > p, Example 3.2.6 suggests 
that an AR(p) model might provide a good representation of the data. To decide what 
is meant by “negligible” we can use the result that for an AR(p) process the sample 
PACF values at lags greater than p are approximately independent N (0, 1/n) random 
variables. This means that roughly 95% of the sample PACF values beyond lag p 
should fall within the bounds +1.96/,/n. If we observe a sample PACF satisfying 
|\a(h)| > 1.96/,/n for 0 < h < p and |a@(h)| < 1.96/./n for h > p, this suggests an 
AR(p) model for the data. For a more systematic approach to model selection, see 
Section 5.5. 


3.2.4 Examples 


The time series plotted in Figure 3.5 consists of 57 consecutive daily overshorts from 
an underground gasoline tank at a filling station in Colorado. If y, is the measured 
amount of fuel in the tank at the end of the rth day and a, is the measured amount 
sold minus the amount delivered during the course of the ¢th day, then the overshort 
at the end of day t¢ is defined as x, = y; — y,-1 + a;. Due to the error in measuring 
the current amount of fuel in the tank, the amount sold, and the amount delivered 
to the station, we view y,, a, and x, as observed values from some set of random 
variables Y,, A,, and X, for t = 1,...,57. (In the absence of any measurement error 
and any leak in the tank, each x, would be zero.) The data and their ACF are plotted 
in Figures 3.5 and 3.6. To check the plausibility of an MA(1) model, the bounds 
+1.96(1 + 2p2(1))?/n'? are also plotted in Figure 3.6. Since 6(h) is well within 
these bounds for h > 1, the data appear to be compatible with the model 


X,=W+Z,+0Z-1, {Z,}~WN(0,0’). (3.2.16) 


The mean u may be estimated by the sample mean x5; = —4.035, and the parameters 
6, o? may be estimated by equating the sample ACVF with the model ACVF at lags 
0 and 1, and solving the resulting equations for 0 and o°. This estimation procedure 
is known as the method of moments, and in this case gives the equations 


(1+ 67)o” = (0) = 3415.72, 
8o? = (1) = — 1719.95. 
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Using the approximate solution 0 = —1 and o? = 1708, we obtain the noninvertible 
MA(1) model 
X, = —4.035 + Z, — Z1, {Z} ~ WN(O, 1708). 
Typically, in time series modeling we have little or no knowledge of the underlying 
physical mechanism generating the data, and the choice of a suitable class of models 
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is entirely data driven. For the time series of overshorts, the data, through the graph 
of the ACF, lead us to the MA(1) model. Alternatively, we can attempt to model the 
mechanism generating the time series of overshorts using a structural model. As we 
will see, the structural model formulation leads us again to the MA(1) model. In the 
structural model setup, write Y,, the observed amount of fuel in the tank at time f, as 


Y, = yř + U, (3.2.17) 


where y* is the true (or actual) amount of fuel in the tank at time ¢ (not to be confused 
with y; above) and U, is the resulting measurement error. The variable y* is an ide- 
alized quantity that in principle cannot be observed even with the most sophisticated 
measurement devices. Similarly, we assume that 


A, =a" +V,, (3.2.18) 


where a; is the actual amount of fuel sold minus the actual amount delivered during 
day t, and V, is the associated measurement error. We further assume that {U,} ~ 
WN(0, of), {Vi} ~ WN(0, ož), and that the two sequences {U,} and {V,} are uncor- 
related with one another (E(U,V,) = 0 for all s and ft). If the change of level per day 
due to leakage is u gallons (u < 0 indicates leakage), then 

y=u+y i ě. (3.2.19) 


This equation relates the actual amounts of fuel in the tank at the end of days t and 
t — 1, adjusted for the actual amounts that have been sold and delivered during the 
day. Using (3.2.17)—(3.2.19), the model for the time series of overshorts is given by 


Xi = Y, — Yi + A = u +U, — Uni +V. 
This model is stationary and 1-correlated, since 


EX,=E(ut+U,—-U,1+ Vi) =u 


and 
y (h) = E[(X nr — u)(X, — WI 
= E[ (Un — Uren-1 + Vin) (U; — Gi + V,)] 
20o +0, ifh=0, 


= —o¢, if |h| = 1, 


0, otherwise. 
It follows from Proposition 2.1.1 that {X,} is the MA(1) model (3.2.16) with 
0; -oè 


pa = 
er) ree 


From this equation we see that the measurement error associated with the adjustment 
{A,} is zero (i.e., of = 0) if and only if p(1) = —.5 or, equivalently, if and only 
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Example 3.2.9 


Figure 3-7 

The sample PACF of the 
sunspot numbers with the 
bounds +1.96/./100. 


if 9; = —1. From the analysis above, the moment estimator of 6, for the overshort 
data is in fact —1, so that we conclude that there is relatively little measurement error 
associated with the amount of fuel sold and delivered. 

We shall return to a more general discussion of structural models in Chap- 
ter 8. 


The sunspot numbers 


Figure 3.7 shows the sample PACF of the sunspot numbers S4, ..., S100 (for the years 
1770 — 1869) as obtained from ITSM by opening the project SUNSPOTS.TSM and 
clicking on the second yellow button at the top of the screen. The graph also shows the 
bounds +1.96//100. The fact that all of the PACF values beyond lag 2 fall within the 
bounds suggests the possible suitability of an AR(2) model for the mean-corrected 
data set X, = S, — 46.93. One simple way to estimate the parameters ¢, ¢2, and o? 
of such a model is to require that the ACVF of the model at lags 0, 1, and 2 should 
match the sample ACVF at those lags. Substituting the sample ACVF values 


y(O) = 1382.2, pd) = 1114.4, yp) =591.73, 


for y (0), y(1), and y (2) in the first three equations of (3.2.5) and (3.2.6) and solving 
for $1, ¢2, and o° gives the fitted model 


X, —1.318X,_; +0.634X, 2 = Z, {Z,} ~ WN(O, 289.2). (3.2.20) 


(This method of model fitting is called Yule—Walker estimation and will be discussed 
more fully in Section 5.1.1.) 
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3.3 Forecasting ARMA Processes 


The innovations algorithm (see Section 2.5.2) provided us with a recursive method 
for forecasting second-order zero-mean processes that are not necessarily stationary. 
For the causal ARMA process 


$(B)X; =6(B)Z;, {Z} ~ WN (0, 0°), 


it is possible to simplify the application of the algorithm drastically. The idea is to 
apply it not to the process {X,} itself, but to the transformed process (cf. Ansley, 1979) 


W, =o 'X,, t=1,...,m, 
(3.3.1) 
W, =o 'h(B)X,, t>m, 
where 
m = max(p, q). (3.3.2) 


For notational convenience we define 6o := 1 and 6; := 0 for j > q. We shall also 
assume that p > 1 and q > 1. (There is no loss of generality in these assumptions, 
since in the analysis that follows we may take any of the coefficients ¢; and 6; to be 
zero.) 

The autocovariance function yy (-) of {X,} can easily be computed using any of 
the methods described in Section 3.2.1. The autocovariances « (i, j) = E(W;W)), 
i, j = 1, are then found from 


oyx(i — j), l<i, j<m 


P 
a E -j)- bree -li - J ` min(i, j) < m < max(i, j) < 2m, 


r=1 


KG, j) = (3.3.3) 
q 
S Pokai min(i, j) >m, 
r=0 
0, otherwise. 
Applying the innovations algorithm to the process {W,} we obtain 
Wrst = S > nj (Wn 1-j — W,, 1 p> Lan <M, 
j=1 
’ (3.3.4) 


q 
Wn+1 = ) Oni Wr 1-j 7 Wrst ps n>m, 
j=l 


A 2 
where the coefficients 6, and the mean squared errors r, = E (Was 1— Wasi) are 


found recursively from the innovations algorithm with « defined as in (3.3.3). The 
notable feature of the predictors (3.3.4) is the vanishing of 0„; when both n > m and 
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j > q. This is a consequence of the innovations algorithm and the fact that x (r, s) = 0 
ifr > mand |r -s| > q. 

Observe now that equations (3.3.1) allow each X,,,n > 1, to be written as a linear 
combination of W;, 1 < j < n, and, conversely, each W,,,n > 1, to be written as a 
linear combination of X;, 1 < j < n. This means that the best linear predictor of any 
random variable Y in terms of {1, X,,..., X,,} is the same as the best linear predictor 
of Y in terms of {1, Wi, ..., Wa}. We shall denote this predictor by P,,Y. In particular, 
the one-step predictors of W,,,; and X,,41 are given by 

Wr = Py Wrst 
and 
xa = Py Xn41- 
Using the linearity of P, and equations (3.3.1) we see that 
W, =o! &,, t=—1,...,m, 
X Ta (3.3.5) 
W, =o" |8, -iX Xp], t>, 
which, together with (3.3.1), shows that 
X,- =0 |w, - W,] for allt > 1. (3.3.6) 
Replacing (W, — W,) by o~! (X; — £;) in (3.3.3) and then substituting into (3.3.4), 
we finally obtain 
X by (Xj a Sis) ’ 1 <n<m, 
x j=l 
Xna = (3.3.7) 


q 
O1Xn etaren OpXn+1-p + X by (Xen; = Ta) ’ n>m, 
j=l 
and 


A 


A 2 2 
E (Xen = Xn) =0°E (Wasi = Wrst) = 0h, (3.3.8) 


where 0„; and r, are found from the innovations algorithm with « as in (3.3.3). 
Equations (3.3.7) determine the one-step predictors X2, X3, . . . recursively. 


Remark 1. It can be shown (see TSTM, Problem 5.6) that if {X,} is invertible, then 
as n —> oo, 


™ 2 
E (Xn — a — Za) o0, 


On) > 0j, TS lg 
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Example 3.3.1 


Example 3.3.2 


Example 3.3.3 


and 
Ty, > 1. 


Algebraic calculation of the coefficients 6,; and r, is not feasible except for very sim- 
ple models, such as those considered in the following examples. However, numerical 
implementation of the recursions is quite straightforward and is used to compute 
predictors in the program ITSM. 


Prediction of an AR(p) process 


Applying (3.3.7) to the ARMA (p, 1) process with 0, = 0, we easily find that 


Xn = = Xn +: +++ pX nti-p, N= DP. 


Prediction of an MA(q) process 
Applying (3.3.7) to the ARMA(1, q) process with ¢; = 0 gives 


min(n,q) 


Xari = y Onj (Bi = Rus) , n>1, 
j=l 
where the coefficients 6,; are found by applying the innovations algorithm to the co- 
variances « (i, j) defined in (3.3.3). Since in this case the processes {X;} and {o 7! W,} 
are identical, these covariances are simply 


q—li—j| 
e= VG — P= D> FO r4i-3. 


r=0 


Prediction of an ARMA(1,1) process 
If 


—$X:1=Z,+90Z;1, {Z:}~ WN (0,07), 
and |ġ| < 1, then equations (3.3.7) reduce to the single equation 


Xnti = GXn + On(Xn— Xn), n21. 


To compute 6,,, we firstuse Example 3.2.1 to find that yy (0) =o? (1 + 20¢ + 67) /(1 
°). Substituting in (3.3.3) then gives, for i, j > 1, 


(1+20¢+6°)/(l1-¢@), i=j=1, 
1+0°, i=j>2, 


k(i, j) = 
0, li- jl=1,i>1, 


IV 


0, otherwise. 
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With these values of « (i, j), the recursions of the innovations algorithm reduce to 
ro = (14+ 206+ 6°) /(1-¢°), 
Ont = 6/Tn-1; (3.3.9) 
Fn = 1 +6? = 07 ffa, 
which can be solved quite explicitly (see Problem 3.13). 
Example 3.3.4 Numerical prediction of an ARMA(2,3) process 


In this example we illustrate the steps involved in numerical prediction of an 
ARMA(2,3) process. Of course, these steps are shown for illustration only. The calcu- 
lations are all carried out automatically by ITSM in the course of computing predictors 
for any specified data set and ARMA model. The process we shall consider is the 
ARMA process defined by the equations 


X, — X1 + 0.24X,_2 = Z, + 0.4Z,_; + 0.2Z,_. + 0.1Z,_3, (3.3.10) 


where {Z,} ~ WN(O, 1). Ten values of X1,..., X19 simulated by the program ITSM 
are shown in Table 3.1. (These were produced using the option Model>Specify to 
specify the order and parameters of the model and then Mode1>Simulate to generate 
the series from the specified model.) 

The first step is to compute the covariances yy (h), h = 0, 1, 2, which are easily 
found from equations (3.2.5) with k = 0, 1, 2 to be 


yx (0) = 7.17133, yy) = 6.44139, and yy(2) = 5.0603. 
From (3.3.3) we find that the symmetric matrix K = [k(i, j)]i,j=1,2,... is given by 
7.1713 
6.4414 7.1713 
5.0603 6.4414 7.1713 
| 0.10 0.34 0.816 1.21 


a | 


K = 0 0.10 0.34 0.50 1.21 
| 0 0 0.10 0.24 0.50 1.21 
0 0 0.10 0.24 0.50 1.21 


0 O 0.10 0.24 0.50 1.21 


The next step is to solve the recursions of the innovations algorithm for 6, and 
r, using these values for « (i, j). Then 


pe (Xe; = Rus) > n= 1,2, 
j=l 


3 
Xn = 0.24Xn1 + J n (Xorij — Ên) 2 = 3,40, 


j=l 


Xn4i = 
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and 


A 


2, 
2 
E (Xess _ Rani) =0 n= fn. 


The results are shown in Table 3.1. 


h-Step Prediction of an ARMA(p, q) Process 
As in Section 2.5, we use P,Y to denote the best linear predictor of Y in terms of 


X1, ..., Xn (which, as pointed out after (3.3.4), is the same as the best linear predictor 
of Y in terms of W,,..., W,,). Then from (2.5.30) we have 
n+h-1 n+h—1 


PaWn+h = ye Onth-1,j (Wes i — Wash i) =o" ye On+h—1,j (Xn j — Xn i): 
j=h j=h 
Using this result and applying the operator P,, to each side of equations (3.3.1), we 
conclude that the h-step predictors P,,X,,4, satisfy 
n+h-1 x 
ťa 1y (Xart — Xeon i) 1<h<m-—n, 
j=h 
P, Oe (3.3.11) 


n+h-1 


P 
X Qi PrX nani + 5 On+h lj (Xess I Xnth oF h>m-n. 
i=1 


j=h 


If, as is almost always the case, n > m = max(p, q), then for all h > 1, 


P q 
Pa Xn4n = Xog P, Xn+h-i ar yo, th—1,j (x, h-j T Xn Hh i) : (3.3.12) 
i=l j=h 


Once the predictors x ee xX n have been computed from (3.3.7), itis a straightforward 
calculation, with n fixed, to determine the predictors P,Xn41, Pp Xn42, PaXn43,--- 


Table 3.1 Koyi for the ARMA(2,3) Process of Example 3.3.4. 


n Xn+1 In Ont On2 On3 Xn+1 

0 1.704 7.1713 0 

1 0.527 1.3856 0.8982 1.5305 

2 1.041 1.0057 1.3685 0.7056 —0.1710 

3 0.942 1.0019 0.4008 0.1806 0.0139 1.2428 

4 0.555 1.0019 0.3998 0.2020 0.0732 0.7443 

5 —1.002 1.0005 0.3992 0.1995 0.0994 0.3138 

6 —0.585 1.0000 0.4000 0.1997 0.0998 —1.7293 

7 0.010 1.0000 0.4000 0.2000 0.0998 —0.1688 

8 —0.638 1.0000 0.4000 0.2000 0.0999 0.3193 

9 0.525 1.0000 0.4000 0.2000 0.1000 —0.8731 
10 1.0000 0.4000 0.2000 0.1000 1.0638 


11 1.0000 0.4000 0.2000 0.1000 
12 1.0000 0.4000 0.2000 0.1000 
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Example 3.3.5 


Table 3.2 


recursively from (3.3.12) (or (3.3.11) if n < m). The calculations are performed 
automatically in the Forecasting>ARMA option of the program ITSM. 


h-step prediction of an ARMA(2,3) process 


To compute h-step predictors, h = 1,...,10, for the data of Example 3.3.4 and 
the model (3.3.10), open the project E334.TSM in ITSM and enter the model using 
the option Model>Specify. Then select Forecasting>ARMA and specify 10 for the 
number of forecasts required. You will notice that the white noise variance is au- 
tomatically set by ITSM to an estimate based on the sample. To retain the model 
value of 1, you must reset the white noise variance to this value. Then click OK and 
you will see a graph of the original series with the ten predicted values appended. 
If you right-click on the graph and select Info, you will see the numerical results 
shown in the following table as well as prediction bounds based on the assumption 
that the series is Gaussian. (Prediction bounds are discussed in the last paragraph of 
this chapter.) The mean squared errors are calculated as described below. Notice how 
the predictors converge fairly rapidly to the mean of the process (i.e., zero) as the lead 
time h increases. Correspondingly, the one-step mean squared error increases from 
the white noise variance (i.e., 1) at h = 1 to the variance of X, (i.e., 7.1713), which 
is virtually reached at h = 10. 


The Mean Squared Error of P, Xn+n 
The mean squared error of P,X„+n is easily computed by ITSM from the formula 


h-1 


: 2 
J 
o? (h) = E(Xn+n a Pda =, Sf (£ XrOn+h r—-l,j .) Unth—j—ls (3.3.13) 
r=0 


j=0 


h-step predictors for the ARMA(2,3) 
Series of Example 3.3.4. 


> 


PioX10+h MSE 


1.0638 1.0000 
1.1217 1.7205 
1.0062 2.1931 
0.7370 2.4643 
0.4955 2.5902 
0.3186 2.6434 
0.1997 2.6648 
0.1232 2.6730 
0.0753 2.6761 
0.0457 2.6773 


DCU OAON DAU BRWNY = 


= 
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where the coefficients x; are computed recursively from the equations xo = 1 and 
min(p, j) 
a Ay PST as (3.3.14) 
k=1 
Example 3.3.6 h-step prediction of an ARMA(2,3) process 


We now illustrate the use of (3.3.12) and (3.3.13) for the h-step predictors and their 
mean squared errors by manually reproducing the output of ITSM shown in Table 
3.2. From (3.3.12) and Table 3.1 we obtain 


2 3 
Pi9X12 = S Qi PioX 12-1 + X Ou) (Xz; a R) 


i=l j=2 


= 6 Xi +X +02 (Xw — Ru) +0.1 (xo — 2o) 
= 1.1217 


and 


2 3 
PioXi3 = X Qi PioX 13-1 + >On; (Xs) z Ris) 
i=l = 
= bP X12 + 2X1, + 0.1 (Xw = Rio) 
= 1.0062. 


For k > 13, PioX; is easily found recursively from 


Pig Xp = Qı PioXk-1 + G2 PioXz-2. 


To find the mean squared errors we use (3.3.13) with Xo l, Xi = ¢ģı 1, and 
X2 = $1 Xı + %2 = 0.76. Using the values of 6,,; and v;(= r;) in Table 3.1, we obtain 


o2 (2) = E(X — PioX12) = 2.960 
and 


o2 8) = E(X13 — PioX13)? = 4.810, 


in accordance with the results shown in Table 3.2. 


Large-Sample Approximations 
Assuming as usual that the ARMA(p, q) process defined by ¢(B)X, = 0(B)Z,, 
{Z} ~ WN (0, o°), is causal and invertible, we have the representations 


Xnth = S Wi Zntn—i (3.3.15) 
j=0 
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and 
Zn+h = Xnth F ye Tj Xn+h—j> (3.3.16) 
j=l 


where {y;} and {7}; } are uniquely determined by equations (3.1.7) and (3.1.8), respec- 
tively. Let P, Y denote the best (i.e., minimum mean squared error) approximation to 
Y that is a linear combination or limit of linear combinations of X;, —œ < t < n, 
or equivalently (by (3.3.15) and (3.3.16)) of Z,, -oo < t < n. The properties of the 
operator P, were discussed in Section 2.5.3. Applying P, to each side of equations 
(3.3.15) and (3.3.16) gives 


Pes, = 5 WiZn+h-j (3.3.17) 
and 


PX o A (3.3.18) 


For h = 1 the jth term on the right of (3.3.18) is just X„+1-;j. Once P, Xni has 
been evaluated, P, X n+2 can then be computed from (3.3.18). The predictors P,X n43> 
P,X,+4,...can then be computed successively in the same way. Subtracting (3.3.17) 
from (3.3.15) gives the h-step prediction error as 


Xnth — B, Xain = 3 Yz n+h—j> 
from which we see that the mean squared error is 
h—1 
a =e Sw. (3.3.19) 
j=0 


The predictors obtained in this way have the form 


(oe) 


PMS yO Ni gs (3.3.20) 
j=0 
In practice, of course, we have only observations X,,..., X, available, so we must 


truncate the series (3.3.20) after n terms. The resulting predictor is a useful approx- 
imation to P, X,+4 if n is large and the coefficients c; converge to zero rapidly as j 
increases. It can be shown that the mean squared error (3.3.19) of PiXark can also 
be obtained by letting n — oo in the expression (3.3.13) for the mean squared error 
of P Xn4n, SO that &?° (h) is an easily calculated approximation to oF (h) for large n. 
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Prediction Bounds for Gaussian Processes 
If the ARMA process {X;} is driven by Gaussian white noise (i.e., if {Z} ~ 
IID N(0, o7)), then for each h > 1 the prediction error X,,,, — P,Xn+n is normally 
distributed with mean 0 and variance ao (h) given by (3.3.19). 

Consequently, if ®;_,/2 denotes the (1 —e/2) quantile of the standard normal dis- 
tribution function, it follows that X,,,,, lies between the bounds P, X p44.) + ®)~9/20n(h) 
with probability (1 — œ). These bounds are therefore called (1 — œ) prediction bounds 
for Xn+n- 

Problems 


3.1. Determine which of the following ARMA processes are causal and which of 
them are invertible. (In each case {Z,} denotes white noise.) 


a. X,+0.2X,_; — 0.48X,_. = Z;. 

b. X, + 1.9X,_; + 0.88X,_. = Z, + 0.2Z,_) + 0.7Z,_9. 
c. X,+0.6X,_-) = Z, + 1.2Z,1. 

d. X,+1.8X,_; + 0.81X,_. = Z;. 

e. X, + 1.6X 1 = Z, —0.4Z,_; + 0.04Z,_>. 


3.2. For those processes in Problem 3.1 that are causal, compute and graph their 
ACF and PACF using the program ITSM. 


3.3. For those processes in Problem 3.1 that are causal, compute the first six co- 
efficients yo, W1, ..., Ws in the causal representation X, = ae, WiZ; of 
{X;}. 


3.4. Compute the ACF and PACF of the AR(2) process 
X, = 8X 2+ Z, {Z,}~WN(0,0’). 


3.5. Let {Y,} be the ARMA plus noise time series defined by 
Y, = X,+ W,, 
where {W,} ~ WN (0, 0), {X,} is the ARMA (p, q) process satisfying 
$(B)X, = 0(B)Z,, {Z} ~ WN (0, o2), 


and E(W,Z,) = 0 for all s and t. 

a. Show that {Y,} is stationary and find its autocovariance in terms of oĉ, and 
the ACVF of {X;}. 

b. Show that the process U, := $(B)Y, is r-correlated, where r = max(p, q) 
and hence, by Proposition 2.1.1, is an MA(r) process. Conclude that {Y,} is 
an ARMA (p, r) process. 
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3.6. 


3.7. 


3.8. 


3.9. 


Show that the two MA(1) processes 
X,=Z,+0Z,1, {Z} ~ WN (0, 0°) 


Y, = Ž, + ae {Z,} ~ WN (0, 076”), 
where 0 < |0| < 1, have the same autocovariance functions. 
Suppose that {X,} is the noninvertible MA(1) process 
X,=Z,+0Z,1, {Z,}~WN(0,0°), 


where |0| > 1. Define a new process {W,} as 


(oe) 


W: = $ C0 X- 


j=0 


and show that {W,} ~ WN (0, o;,). Express oy, in terms of 6 and o° and show 
that {X,} has the invertible representation (in terms of {W,}) 


1 
Xx; = W, + gene 


Let {X,} denote the unique stationary solution of the autoregressive equations 
X =X +Z, t=0,1,..., 


where {Z,} ~ WN(0, o?) and |ġ| > 1. Then X, is given by the expression 
(2.2.11). Define the new sequence 


1 
W, = X, — p 
show that {W,} ~ WN (0, oy,), and express of, in terms of o? and ¢. These 
calculations show that {X,} is the (unique stationary) solution of the causal AR 
equations 


1 
em ae P= Ops losin 


a. Calculate the autocovariance function y (-) of the stationary time series 
Y, = u + Zi +0:Z1 +0212, {Z} ~ WN (0, o°). 


b. Use the program ITSM to compute the sample mean and sample autoco- 
variances p(h), 0 < h < 20, of {VV 2X;}, where {X,, t = 1,..., 72} is the 
accidental deaths series DEATHS.TSM of Example 1.1.3. 

c. By equating (1), (11), and 7 (12) from part (b) to y (1), y (11), and y (12), 
respectively, from part (a), find a model of the form defined in (a) to represent 
{VV12X;}. 
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3.10. 


3.11. 


3.12. 


3.13. 


By matching the autocovariances and sample autocovariances at lags 0 and 1, 
fit a model of the form 


X,-w=O(%-1-W)+Z, {Z}~ WN(0,0°), 


to the data STRIKES.TSM of Example 1.1.6. Use the fitted model to compute 
the best predictor of the number of strikes in 1981. Estimate the mean squared 
error of your predictor and construct 95% prediction bounds for the number of 
strikes in 1981 assuming that {Z,} ~ iid N(0, o°). 


Show that the value at lag 2 of the partial ACF of the MA(1) process 
X,=Z,+0Z,;, t=0,+1,..., 
where {Z,} ~ WN(0, o°), is 
a(2) = —0°/ (1 +0 +0"). 
For the MA(1) process of Problem 3.11, the best linear predictor of X„+ı based 
on X;,..., X iS 
Xn = GniXn + + Onn Xi, 


where @, = (Oni,---; nn). satisfies R,g@n = Pn (equation (2.5.23)). By sub- 
stituting the appropriate correlations into R,, and p, and solving the resulting 
equations (starting with the last and working up), show that for 1 < j < n, 
bnn—j = (10) (1 +6? +--+ + 0” )nn and hence that the PACF a(n) := 
ban = —(—6)"/(1+ 6? + +++ +0”). 

The coefficients 6,; and one-step mean squared errors v, = r,o° for the general 
causal ARMA(1,1) process in Example 3.3.3 can be found as follows: 


a. Show that if yn := ra/(ra — 1), then the last of equations (3.3.9) can be 
rewritten in the form 


Yn =O? yn +1, n= 1. 
b. Deduce that y, = 97?” yt; 0720-D and hence determine r, and 6,,;, = 
E2 irg 


c. Evaluate the limits as n —> oo of r, and 0,„ı in the two cases |0| < 1 and 
|O| > 1. 


Spectral Analysis 


4.1 Spectral Densities 

4.2 The Periodogram 

4.3 Time-Invariant Linear Filters 

4.4 The Spectral Density of an ARMA Process 


This chapter can be omitted without any loss of continuity. The reader with no back- 
ground in Fourier or complex analysis should go straight to Chapter 5. The spectral 
representation of a stationary time series {X,} essentially decomposes {X,} into a 
sum of sinusoidal components with uncorrelated random coefficients. In conjunction 
with this decomposition there is a corresponding decomposition into sinusoids of the 
autocovariance function of {X;}. The spectral decomposition is thus an analogue for 
stationary processes of the more familiar Fourier representation of deterministic func- 
tions. The analysis of stationary processes by means of their spectral representation is 
often referred to as the “frequency domain analysis” of time series or “spectral analy- 
sis.” It is equivalent to “time domain” analysis based on the autocovariance function, 
but provides an alternative way of viewing the process, which for some applications 
may be more illuminating. For example, in the design of a structure subject to a 
randomly fluctuating load, it is important to be aware of the presence in the loading 
force of a large sinusoidal component with a particular frequency to ensure that this 
is not a resonant frequency of the structure. The spectral point of view is also particu- 
larly useful in the analysis of multivariate stationary processes and in the analysis of 
linear filters. In Section 4.1 we introduce the spectral density of a stationary process 
{X,}, which specifies the frequency decomposition of the autocovariance function, 
and the closely related spectral representation (or frequency decomposition) of the 
process {X,} itself. Section 4.2 deals with the periodogram, a sample-based function 
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from which we obtain estimators of the spectral density. In Section 4.3 we discuss 
time-invariant linear filters from a spectral point of view and in Section 4.4 we use 
the results to derive the spectral density of an arbitrary ARMA process. 


4.1 Spectral Densities 


Suppose that {X,} is a zero-mean stationary time series with autocovariance function 
y(-) satisfying $}; |v (A)| < oo. The spectral density of {X,} is the function f (-) 
defined by 


(oe) 


fa = XO eil™y(h), 00 <A < oo, (4.1.1) 


where e’* = cos(A) + i sin(A) andi = /—1. The summability of |y(-)| implies that 
the series in (4.1.1) converges absolutely (since jer? = cos? (hà) + sin (hd) = 1). 
Since cos and sin have period 27x, so also does f, and it suffices to confine attention 
to the values of f, on the interval (—x, 7]. 


Basic Properties of f: 
(a) f is even, i.e., f(A) = f (~A), (4.1.2) 
(b) f(A) > 0 for all A € (=x, 7], (4.1.3) 
and 
(c) y (k) = / : e FQ) dA = 1 i cos (kà) f(A) da. (4.1.4) 


Proof Since sin(-) is an odd function and cos(-) and y(-) are even functions, we have 


fA) = — 5 (cos(hd) — i sin(hd))y (h) 


TE S65 


= — => cos(—hà)y (h) +0 


chy nares 


= f(-a). 


For each positive integer N define 


SQ) = el Ke 


) 
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Definition 4.1.1 


Proposition 4.1.1 


1 y —irk y ish 
om Dee X Xe 


s=l1 
= ay 2 N- ae yth), 
|h|<N 


where Ty = [y(i — DI ‘j-1- Clearly, the function fy is nonnegative for each N, 
and since fy(A) > iye e’y(h) = f(A) as N > œ, f must also be 
nonnegative. This DN (4.1.3). Turning to (4.1.4), 


fe ikh f fone fom Se i(k—h)d y(h) da 


TE 50 
= — => yn f ead. 
T p= oo 
= y(k), 
since the only nonzero summand in the second line is the one for which h = k (see 
Problem 4.1). E 


Equation (4.1.4) expresses the autocovariances of a stationary time series with 
absolutely summable ACVF as the Fourier coefficients of the nonnegative even func- 
tion on (—z, 7x ] defined by (4.1.1). However, even if 77°. |v(A)| = oo, there may 
exist a corresponding spectral density defined as follows. 


—00 


A function f is the spectral density of a stationary time series {X,} with ACVF 
y() if 
© f(A) = 0 for all A € (0, 7], 


(ii) y(h) = / e'"* f (à) da for all integers h. 


Remark 1. Spectral densities are essentially unique. That is, if f and g are two 
spectral densities corresponding to the autocovariance function y(-), i.e., y(h) = 
SZ fO) da = f7, e'™ g(a) da for all integers h, then f and g have the same 
Fourier coefficients and hence are equal (see, for example, TSTM, Section 2.8). 


The following proposition characterizes spectral densities. 


A real-valued function f defined on ( — 1,7] is the spectral density of a stationary 
process if and only if 


© fa =f(-A), 
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(ii) f(A) = 0, and 
Gii) JZ, fA)da < o. 
Proof If y(-) is absolutely summable, then (i)—(iii) follow from the basic properties of f, 


Corollary 4.1.1 


Proof 


Example 4.1.1 


(4.1.2)-(4.1.4). For the argument in the general case, see TSTM, Section 4.3. 
Conversely, suppose f satisfies (i)—(iii). Then it is easy to check, using (i), that 
the function defined by 


iis f "ei (nda 


au 


is even. Moreover, if a, € R,r =1,...,n, then 
> ary (r — s)as = [ 2 arase f(A) dr 
-f Zoae” A dh 


> 0, 
so that y (-) is also nonnegative definite and therefore, by Theorem 2.1.1, is an auto- 


covariance function. E 


An absolutely summable function y (- ) is the autocovariance function of a stationary 
time series if and only if it is even and 


fA) = — 5 ey (hy >0, forallà € (—x, x], (4.1.5) 


Mja 


in which case f(- ) is the spectral density of y (- ). 


We have already established the necessity of (4.1.5). Now suppose (4.1.5) holds. 
Applying Proposition 4.1.1 (the assumptions are easily checked) we conclude that f 
is the spectral density of some autocovariance function. But this ACVF must be y (-), 
since y (k) = f7, e!** f (A) da for all integers k. E 


T 


Using Corollary 4.1.1, it is a simple matter to show that the function defined by 
1, ifh=0, 
k(h)=\ p, ifh= +1, 
0, otherwise, 


is the ACVF of a stationary time series if and only if |p| < 1 (see Example 2.1.1). 
Since «(-) is even and nonzero only at lags 0, +1, it follows from the corollary that 
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Theorem 4.1.1 


k is an ACVF if and only if the function 


fO) = — ye e™y(h) = +u + 2p cosà] 


IE p= 


is nonnegative for all à € [—z, x]. But this occurs if and only if |o| < L, 
As illustrated in the previous example, Corollary 4.1.1 provides us with a powerful 
tool for checking whether or not an absolutely summable function on the integers 
is an autocovariance function. It is much simpler and much more informative than 
direct verification of nonnegative definiteness as required in Theorem 2.1.1. 
Not all autocovariance functions have a spectral density. For example, the sta- 
tionary time series 


X, = Acos(wt) + B sin(æt), (4.1.6) 


where A and B are uncorrelated random variables with mean 0 and variance 1, has 
ACVF y(h) = cos(wh) (Problem 2.2), which is not expressible as f7, e”* f (A)dA, 
with f a function on (—z, 2]. Nevertheless, y (-) can be written as the Fourier trans- 
form of the discrete distribution function 


0 ifA < —a, 
FA)= 3405 if-w< <a, 
1.0 ifAa>a, 


cos(wh) = f ed F(A), 
(=x,7] 


where the integral is as defined in Section A.1. As the following theorem states (see 
TSTM, p. 117), every ACVF is the Fourier transform of a (generalized) distribution 
function on [—z, 7 ]. This representation is called the spectral representation of the 
ACVF. 


(Spectral Representation of the ACVF) A function y( - ) defined on the integers is 
the ACVF of a stationary time series if and only if there exists a right-continuous, 
nondecreasing, bounded function F on | — 1,1] with F( — z) = 0 such that 


y(h) = f ehd FO) (4.1.7) 
(=x,7] 


for all integers h. (For real-valued time series, F is symmetric in the sense that 
San dF) = fi- -a FŒ) for all a and b such that O < a < b.) 


Remark 2. The function F is a generalized distribution function on [—z, 7] in 
the sense that G(A) = F(A)/F(z) is a probability distribution function on [—z, 7]. 
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Example 4.1.2 


Note that since F(z) = y(0) =Var(X1), the ACF of {X,} has spectral representation 
p(h) = / el” dG(a). 
(-—1,7] 


The function F in (4.1.7) is called the spectral distribution function of y (-). If F(A) 
can be expressed as F(A) = fÈ, f(y) dy for all A € [—z, 7], then f is the spectral 
density function and the time series is said to have a continuous spectrum. If F is 
a discrete distribution (i.e., if G is a discrete probability distribution), then the time 
series is said to have a discrete spectrum. The time series (4.1.6) has a discrete 
spectrum. 


Linear combination of sinusoids 


Consider now the process obtained by adding uncorrelated processes of the type 
defined in (4.1.6), i.e., 


k 
X, = Ý (Aj cos(w;t) + Bj sin(wjt)), 0< a) < <o <m, (4.1.8) 

j=l 
where Aj, B1, ..., Ak, By are uncorrelated random variables with E(A;) = O and 
Var(A;) = Var(B;) = a7, j =1,...,k. By Problem 4.5, the ACVF of this time 
series is y(h) = Xi o? cos(w;h) and its spectral distribution function is F(A) = 


i o? F; (à), where 


0 ifà<—oj, 


A sample path of this time series with k = 2, a, = 7/4, œ = 1/6, o? = 9, and 
o? = 1 is plotted in Figure 4.1. Not surprisingly, the sample path closely approximates 
a sinusoid with frequency w, = x /4 (and period 277/w, = 8). The general features of 
this sample path could have been deduced from the spectral distribution function (see 
Figure 4.2), which places 90% of its total mass at the frequencies +7/4. This means 
that 90% of the variance of X, is contributed by the term A, cos(@,t) + Bı cos(@,f), 
which is a sinusoid with period 8. 


The remarkable feature of Example 4.1.2 is that every zero-mean stationary pro- 
cess can be expressed as a superposition of uncorrelated sinusoids with frequencies 
w € [0, x]. In general, however, a stationary process is a superposition of infinitely 
many sinusoids rather than a finite number as in (4.1.8). The required generalization 
of (4.1.8) that allows for this is called a stochastic integral, written as 


X,= f edZO), (4.1.9) 
(=x,7] 
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Figure 4-1 

A sample path of size 
100 from the time series 
in Example 4.1.2. 


Figure 4-2 

The spectral distribution 
function F(A), =n <à < x, 
of the time series 

in Example 4.1.2. 


where {Z (à), =m < A < m} is a complex-valued process with orthogonal (or un- 
correlated) increments. The representation (4.1.9) of a zero-mean stationary process 
{X,} is called the spectral representation of the process and should be compared 
with the corresponding spectral representation (4.1.7) of the autocovariance function 
y(-). The underlying technical aspects of stochastic integration are beyond the scope 
of this book; however, in the simple case of the process (4.1.8) it is not difficult to 


l l l l l l l 
-3 -2 1 0 1 2 3 


Frequency 
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Example 4.1.3 


Example 4.1.4 


see that it can be reexpressed in the form (4.1.9) by choosing 


A; +iB; 
za Oe if 4 = —w, and j € {1,..., k}, 


dZ) =} A; SiR 
A) sa if A = œj and j € {1,..., k}, 


0, otherwise. 
For this example it is also clear that 


2 
j 
E(dZ(A)dZ(A)) = 4 2’ 


0, otherwise. 


In general, the connection between dZ(A) and the spectral distribution function of 
the process can be expressed symbolically as 


F(A)— F(Q-—), fora discrete spectrum, 
E(dZ(A)dZ(A)) = (4.1.10) 
fad, for a continuous spectrum. 


These relations show that a large jump in the spectral distribution function (or a 
large peak in the spectral density) at frequency +w indicates the presence in the time 
series of strong sinusoidal components at (or near) w. The period of a sinusoid with 
frequency w radians per unit time is 27/w. 


White noise 
If {X,} ~ WN (0, o°), then y(0) = o° and y(h) = 0 for all |h| > 0. This process 
has a flat spectral density (see Problem 4.2) 


o2 
fA =o). ASAE 
27 


A process with this spectral density is called white noise, since each frequency in the 
spectrum contributes equally to the variance of the process. 


The spectral density of an AR(1) process 
If 


X, = X + Zr, 
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where {Z,} ~ WN(0, 0°), then from (4.1.1), {X;} has spectral density 


EN o? = h (—ihd ind 
A ae) (: elem te ) 


= o2 ; ge ge 
= seas 1— gel ' S) 


ze (1 — 2¢ cos à + p` 
S : 
Graphs of f(A), 0 < A < x, are displayed in Figures 4.3 and 4.4 for @ = .7 and 
o = —.7. Observe that for ¢ = .7 the density is large for low frequencies and small 


for high frequencies. This is not unexpected, since when ¢ = .7 the process has a 
positive ACF with a large value at lag one (see Figure 4.5), making the series smooth 
with relatively few high-frequency components. On the other hand, for ¢ = —.7 the 
ACF has a large negative value at lag one (see Figure 4.6), producing a series that 
fluctuates rapidly about its mean value. In this case the series has a large contribution 
from high-frequency components as reflected by the size of the spectral density near 
frequency r. 


Example 4.1.5 Spectral density of an MA(1) process 


If 
Xx, = Zi + OZ, 
OL 
SL 
OL 
oO 
Figure 4-3 
The spectral density 
f(A), 0 <À < 7, of Sca | | | | | f 
Xi = -7X1 + Z, where ° 00 0.5 1.0 1.5 2.0 2.5 3.0 


{Z} ~ wn (o, o°). Frequency 
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The spectral density | | | 
f(A),O < A < x, of 2 | a . - 
Xi = —.7 X1 + Zt, where 0 5 10 15 20 
{Zi} ~ Wn(0, o°). Lag 


where {Z,} ~ WN(0, o°), then from (4.1.1), 


jo (1+6? +0 (e> + e”)) = z (1 +20 cos 4 + 6) 
Sg ee ON oe 


This function is shown in Figures 4.7 and 4.8 for the values 0 = .9 and 0 = —.9. Inter- 
pretations of the graphs analogous to those in Example 4.1.4 can again be made. 
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Figure 4-5 
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process X; = .7 X1 + Z. Lag 
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4.2 The Periodogram 


Figure 4-7 

The spectral density 
f(A),O < A < x, of 

X, = Z, + .9Z,_, where 
(Zj ~ wn (o, o°). 


If {X,} is a stationary time series {X,} with ACVF y (-) and spectral density f (-), then 
just as the sample ACVF ?(-) of the observations {x,,..., Xn} can be regarded as a 
sample analogue of y(-), so also can the periodogram 7,(-) of the observations be 
regarded as a sample analogue of 27 f (-). 
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To introduce the periodogram, we consider the vector of complex numbers 


Xn 


where C” denotes the set of all column vectors with complex-valued components. 
Now let w = 27r k/n, where k is any integer between — (n — 1)/2 and n /2 (inclusive), 
i.e., 


27k —1 
A E r=-|” kalal (4.2.1) 
n 2 


where [y] denotes the largest integer less than or equal to y. We shall refer to the set F, 
of these values as the Fourier frequencies associated with sample size n, noting that 
F, is a subset of the interval (—z, 2]. Correspondingly, we introduce the n vectors 


(4.2.2) 
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Definition 4.2.1 


Now e),..., €, are orthonormal in the sense that 
1, ifj=k, 
ej; e& = (4.2.3) 
0, ifs; Fk, 


where e;* denotes the row vector whose kth component is the complex conjugate of 
the kth component of e; (see Problem 4.3). This implies that {e;,...,e,} is a basis 
for C”, so that any x € C” can be expressed as the sum of n components, 


[n/2] 
x= Y ayer. (4.2.4) 
k=—[(n—1)/2] 


The coefficients a, are easily found by multiplying (4.2.4) on the left by e” and using 
(4.2.3). Thus, 


12 f 
ak = e“xX = — ye Ok, (4.2.5) 
Jn 2 ' 


The sequence {az} is called the discrete Fourier transform of the sequence 
{x1, Pee e Xn}. 


Remark 1. The th component of (4.2.4) can be written as 


[n/2] 
x= > ajlcos(axt) +isin(oxt)], t=1,...,n, (4.2.6) 
k=—[(n-1)/2] 


showing that (4.2.4) is just a way of representing x, as a linear combination of sine 
waves with frequencies a, € F,. 


The periodogram of {x;,..., Xn} is the function 


n 
J xe" 
1l 


2 


Remark 2. If A is one of the Fourier frequencies œx, then 1, (œx) = |ax|?, and so 
from (4.2.4) and (4.2.3) we find at once that the squared length of x is 


n [n/2] [n/2] 

2 2 
Soares S lar Sc. 8 Ree 
t=l k=—[(n—1)/2] k=—[(n=1)/2] 


The value of the periodogram at frequency a, is thus the contribution to this sum of 
squares from the “frequency wg” term azez in (4.2.4). 
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Proposition 4.2.1 


Proof 


The next proposition shows that /,,(A) can be regarded as a sample analogue of 
2m f (A). Recall that if °°. |v (A)| < 00, then 


=00 


(oe) 


fA = Yo y(hye™, AE (=n, n]. (4.2.8) 
h=—oo 
If xı, ...,X, are any real numbers and œ; is any of the nonzero Fourier frequencies 
2ak/n in (— n,n], then 
hops > rie, (4.2.9) 
|h|<n 


where y(h) is the sample ACVF of xı, ...,Xn- 


Since )~"_, e% = 0 if œ 4 0, we can subtract the sample mean x from x, in the 
defining equation (4.2.7) of J, (œp). Hence, 


Llo =n} YG — BG — He 


=l t=1 


= rae". m 


|h|<n 


In view of the similarity between (4.2.8) and (4.2.9), a natural estimate of the 
spectral density f(A) is I,(A)/(27). For a very large class of stationary time series 
{X;,} with strictly positive spectral density, it can be shown that for any fixed frequen- 


cies ài, ..., Am such that O < A, < -> < Àm < 7, the joint distribution function 
Fa (x1, - . - , Xm) of the periodogram values (Z, (A1), ..., In (àm)) converges, asn —> œo, 
to F(x,,..., Xm), where 
m Xi : 
JI -e| N). if x1,...,X», > 0, 
Flier Ma) Sa 2m f (Aj) (4.2.10) 
0, otherwise. 
Thus for large n the periodogram ordinates (J,,(A1),..., In(Àm)) are approximately 


distributed as independent exponential random variables with means 2x f(A), ..., 
21 f (Am), respectively. In particular, for each fixed à € (0, x) and € > 0, 


PUL, A) —22fQ)| > €] > p> 0, asn > œ, 


so the probability of an estimation error larger than € cannot be made arbitrarily 
small by choosing a sufficiently large sample size n. Thus, /,,(A) is not a consistent 
estimator of 27 f(A). 

Since for large n the periodogram ordinates at fixed frequencies are approximately 
independent with variances changing only slightly over small frequency intervals, we 
might hope to construct a consistent estimator of f(A) by averaging the periodogram 
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Definition 4.2.2 


estimates in a small frequency interval containing à, provided that we can choose 
the interval in such a way that its width decreases to zero while at the same time 
the number of Fourier frequencies in the interval increases to œo as n — oo. This 
can indeed be done, since the number of Fourier frequencies in any fixed frequency 
interval increases approximately linearly with n. Consider, for example, the estimator 


1 


TO = 5 


X 2m + I (g, A) + 207j/n), (4.2.11) 


\jl<m 


where m = „/n and g(n, A) is the multiple of 27/n closest to 4. The number of 
periodogram ordinates being averaged is approximately 2,/n, and the width of the 
frequency interval over which the average is taken is approximately 47 /./n. It can be 
shown (see TSTM, Section 10.4) that this estimator is consistent for the spectral den- 
sity f. The argument in fact establishes the consistency of a whole class of estimators 
defined as follows. 


A discrete spectral average estimator of the spectral density f(A) has the form 


; 1 
FO) = z> 2, Wal dVIn(g(, à) +2rj/n), (4.2.12) 


lilsmn 
where the bandwidths m, satisfy 
m, > œ and m,/n —> Qas n > œ, (4.2.13) 


and the weight functions W,(-) satisfy 


Wa G) = Wa (~j), Wa (G) = O for all j, (4.2.14) 
wg, (4.2.15) 
|jl<mn 
and 
`X W2(j) > Oasn > œ. (4.2.16) 
|jl<mn 


Remark 3. The conditions imposed on the sequences {m,} and {W,,(-)} ensure 
consistency of f (À) for f (A) for a very large class of stationary processes (see TSTM, 
Theorem 10.4.1) including all the ARMA processes considered in this book. The 
conditions (4.2.13) simply mean that the number of terms in the weighted average 
(4.2.12) goes to co as n — œ while at the same time the width of the frequency 
interval over which the average is taken goes to zero. The conditions on {W,,(-)} 
ensure that the mean and variance of f(A) converge as n — oo to f(A) and 0, 
respectively. Under the conditions of TSTM, Theorem 10.4.1, it can be shown, in 
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Example 4.2.1 


Figure 4-9 

The spectral density 
estimate, /o0(A)/(2z), 

0 < à < x, of the sunspot 
numbers, 1770-1869. 


fact, that 


lim Ef (A) = f(a) 


and 
=f 2P A) ifA=v=Oorz, 
im ( wo) Cov( fO), POD = 4 PO) if0<A=v <7, 
d 0 ifa £v. 
For the simple moving average estimator with m, = yn and W,(j) = (2m, + 
1)7!, |j| < m,, Remark 3 gives 


s 2f A) ifa=O0 , 
eva 1) ae (Fa) > | NE D ee 
POO AROS, 


In practice, when the sample size n is a fixed finite number, the choice of m and 
{W (-)} involves a compromise between achieving small bias and small variance for 
the estimator f(A). A weight function that assigns roughly equal weights to a broad 
band of frequencies will produce an estimate of f (A) that, although smooth, may have 
a large bias, since the estimate of f(A) depends on the values of J, at frequencies 
distant from A. On the other hand, a weight function that assigns most of its weight to 
a narrow frequency band centered at zero will give an estimator with relatively small 
bias, but with a larger variance. In practice it is advisable to experiment with a range 
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of weight functions and to select the one that appears to strike a satisfactory balance 
between bias and variance. 

The option Spectrum>Smoothed Periodogram in the program ITSM allows 
the user to apply up to 50 successive discrete spectral average filters with weights 
W(j) =1/Qm+1), j = —m, —m +1,...,m, to the periodogram. The value of m 
for each filter can be specified arbitrarily, and the weights of the filter corresponding 
to the combined effect (the convolution of the component filters) is displayed by 
the program. The program computes the corresponding discrete spectral average 
estimators ÊO), O<A<zZ. 


Example 4.2.2 The sunspot numbers, 1770-1869 


Figure 4.9 displays a plot of (277)~! times the periodogram of the annual sunspot 
numbers (obtained by opening the project SUNSPOTS.TSM in ITSM and select- 
ing Spectrum>Periodogram). Figure 4.10 shows the result of applying the discrete 
spectral weights {+, +, +} (corresponding tom = 1, W(j) = 1/(2m+1), |j| < m). It 
is obtained from ITSM by selecting Spectrum>Smoothed Periodogram, entering 
1 for the number of Daniell filters, 1 for the order m, and clicking on Apply. As 
expected, with such a small value of m, not much smoothing of the periodogram 
occurs. If we change the number of Daniell filters to 2 and set the order of the first 
filter to 1 and the order of the second filter to 2, we obtain a combined filter with a 
more dispersed set of weights, W (0) = W(1) = 4, W(2) = Z, W(3) = +. Click- 
ing on Apply will then give the smoother spectral estimate shown in Figure 4.11. 
When you are satisfied with the smoothed estimate click OK, and the dialog box will 
close. All three spectral density estimates show a well-defined peak at the frequency 
wio = 27/10 radians per year, in keeping with the suggestion from the graph of the 
data itself that the sunspot series contains an approximate cycle with period around 


10 or 11 years. 


4.3 Time-Invariant Linear Filters 


In Section 1.5 we saw the utility of time-invariant linear filters for smoothing the data, 
estimating the trend, eliminating the seasonal and/or trend components of the data, 
etc. A linear process is the output of a time-invariant linear filter (TLF) applied to a 
white noise input series. More generally, we say that the process {Y,} is the output of 
a linear filter C = {c; z, t, k =O+1,...} applied to an input process {X,} if 


Y= DOG PSO ae (4.3.1) 


k=—0o 


The filter is said to be time-invariant if the weights c, ,_; are independent of t, i.e., if 


Ct t—-k = Wk. 
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Example 4.3.1 


Example 4.3.2 


Proposition 4.3.1 


Proof 


so that the time-shifted process {Y;_,,f = 0,+1,...} is obtained from {X;_,,t = 
0, +1, ...} by application of the same linear filter Y = {w,;, j = 0,+1,...}. The 
TLF W is said to be causal if 


w; =Ofor j <0, 


since then Y, is expressible in terms only of X,, s < t. 


The filter defined by 
Y,=aX_,, t=0,+1,..., 


is linear but not time-invariant, since c,,;_, = 0 except when 2t = k. Thus, c, -k 
depends on the value of t. 


The simple moving average 


The filter 
¥,= (2g +1)! DO Xo 


lils@ 


is a TLF with y; = (2q + 1)', j = —q,...,q, and w; = 0 otherwise. 


Spectral methods are particularly valuable in describing the behavior of time- 
invariant linear filters as well as in designing filters for particular purposes such as 
the suppression of high-frequency components. The following proposition shows 
how the spectral density of the output of a TLF is related to the spectral density of 
the input—a fundamental result in the study of time-invariant linear filters. 


Let {X,} be a stationary time series with mean zero and spectral density fy (A). 
Suppose that Y = {w;,j = 0, +1,...} is an absolutely summable TLF (i.e., 
Doo [Wi] < 00). Then the time series 


is stationary with mean zero and spectral density 
= 2 sj 
fr) = |W(e | fe = Ye) U(e*) fx), 
where U(e'*) = Sv Co Wie”. (The function Y (e`) is called the transfer func- 


tion of the filter, and the squared modulus |Y (e> I is referred to as the power 
transfer function of the filter.) 


Applying Proposition 2.2.1, we see that {Y,} is stationary with mean 0 and ACVF 


wh = $ vivevx(h+k— j). (4.3.2) 


j,k=—00 
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Since {X,} has spectral density fy (4), we have 


yx(h+k- j) = / OI FAX, (4.3.3) 


=g 


which, by substituting (4.3.3) into (4.3.2), gives 


wh= Yo Yy f e CIO Pe (A) dA 


j,k=—00 a 


= f (È we) (>: we") el! Fy (A) dà 


j=—-œ k=—0o 


x 
= f ei> 
=r 


The last expression immediately identifies the spectral density function of {Y,} as 


fra) = |W(e“) k = (eM) (e) fr). m 


2 
fxà)dà. 


[06] 
È we 


j=- 


Remark 1. Proposition 4.3.1 allows us to analyze the net effect of applying one or 
more filters in succession. For example, if the input process {X,} with spectral density 
fx is operated on sequentially by two absolutely summable TLFs Y, and Wy, then 
the net effect is the same as that of a TLF with transfer function Y; (e~) Y, (e7) 
and the spectral density of the output process 


W, = Yı (B): (B) X, 


is |Y (e7*) Wa (e7) [fe (A). (See also Remark 2 of Section 2.2.) 


As we saw in Section 1.5, differencing at lag s is one method for removing a 
seasonal component with period s from a time series. The transfer function for this 
filter is 1 — e~'**, which is zero for all frequencies that are integer multiples of 27r /s 
radians per unit time. Consequently, this filter has the desired effect of removing all 
components with period s. 

The simple moving-average filter in Example 4.3.2 has transfer function 


W(e) = D, (A), 
where D, (A) is the Dirichlet kernel 
sin[(g + .5)A] 
DQ) = Qq +D Soe =} Ag+ Dsin@/2)’ 
ey l, ifà=0. 


A graph of D, is given in Figure 4.12. Notice that | D,(A)| is near | in a neighborhood 
of 0 and tapers off to 0 for large frequencies. This is an example of a low-pass filter. 


ifr £0, 
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Figure 4-12 

The transfer function 
D,0(A) for the simple 
moving-average filter. 
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Frequency 
The ideal low-pass filter would have a transfer function of the form 


a 1; if |A| < Wc, 
We”) = 
0, if |Al > a, 


where w, is a predetermined cutoff value. To determine the corresponding linear filter, 
we expand W(e~'*) as a Fourier series, 


W(e") = 5 pje, (4.3.4) 


j==0% 


with coefficients 


1 fe, 
w==] ée da = 


2 if j =0, 
T 


Se)” pisi, 


We can approximate the ideal low-pass filter by truncating the series in (4.3.4) at some 
large value q, which may depend on the length of the observed input series. In Figure 
4.13 the transfer function of the ideal low-pass filter with w. = 7 /4 is plotted with the 
approximations Y% (e-i) = Sq Yje forq = 2and q = 10. As can be seen in 
the figure, the approximations do not mirror Y very well near the cutoff value w, and 
behave like damped sinusoids for frequencies greater than w.. The poor approximation 
in the neighborhood of w+ is typical of Fourier series approximations to functions with 
discontinuities, an effect known as the Gibbs phenomenon. Convergence factors may 
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Figure 4-13 

The transfer function 
for the ideal low-pass 
filter and truncated 
Fourier approximations 
Ww? for q = 2,10. 
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Frequency 
be employed to help mitigate the overshoot problem at w, and to improve the overall 
approximation of Y% (e™™) to W(e~") (see Bloomfield, 2000). 


4.4 The Spectral Density of an ARMA Process 


Proof 


In Section 4.1 the spectral density was computed for an MA(1) and for an AR(1) 
process. As an application of Proposition 4.3.1, we can now easily derive the spectral 
density of an arbitrary ARMA (p, q) process. 


Spectral Density of an ARMA (p, q) Process: 
If {X,} is a causal ARMA(p, q) process satisfying 6(B)X, = 0 (B) Z,, then 


LALN. (4.4.1) 


Because the spectral density of an ARMA process is a ratio of trigonometric poly- 
nomials, it is often called a rational spectral density. 


From (3.1.3), {X,} is obtained from {Z,} by application of the TLF with transfer 
function 
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Example 4.4.1 


Figure 4-14 

The spectral density 
fx(à),0 < A < x of the 
AR(2) model (3.2.20) fitted 
to the mean-corrected 
sunspot series. 


Since {Z,} has spectral density fz (à) = o7/(2z), the result now follows from Propo- 
sition 4.3.1. E 


For any specified values of the parameters ¢ı,...,@p,01,..., 0; and o?, the 
Spectrum>Model option of ITSM can be used to plot the model spectral density. 


The spectral density of an AR(2) process 


For an AR(2) process (4.4.1) becomes 


o2 


(1 — pje- — gre) (1 — pie — ne?!) 


o2 


2a (1 + of +242 + 63 + 2php — $1) cos à — 4G cos? à) 
Figure 4.14 shows the spectral density, found from the Spectrum>Model option of 
ITSM, for the model (3.2.20) fitted to the mean-corrected sunspot series. Notice the 
well-defined peak in the model spectral density. The frequency at which this peak 
occurs can be found by differentiating the denominator of the spectral density with 
respect to cos A and setting the derivative equal to zero. This gives 


cos À = Pipa — Or = 0.849. 
4g 


2 


fx) = 2 


The corresponding frequency is à = 0.556 radians per year, or equivalently 
c = à/(2m) = 0.0885 cycles per year, and the corresponding period is therefore 
1/0.0885 = 11.3 years. The model thus reflects the approximate cyclic behavior of 


400 
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Example 4.4.2 


Problems 


the data already pointed out in Example 4.2.2. The model spectral density in Figure 
4.14 should be compared with the rescaled periodogram of the data and the nonpara- 
metric spectral density estimates of Figures 4.9-4.11. 


The ARMA(1,1) process 


In this case the expression (4.4.1) becomes 
o7(1 + 6e)(1 + 6e7*) 
2n(1 — ge)(1 — ge) 

= 0° (1+ 6? + 26 cosà) 
2n(1 + 62 — 2¢ cos à) 


fxQ) = 


Rational Spectral Density Estimation 

An alternative to the spectral density estimator of Definition 4.2.2 is the estimator 
obtained by fitting an ARMA model to the data and then computing the spectral 
density of the fitted model. The spectral density shown in Figure 4.14 can be regarded 
as such an estimate, obtained by fitting an AR(2) model to the mean-corrected sunspot 
data. 

Provided that there is an ARMA model that fits the data satisfactorily, this proce- 
dure has the advantage that it can be made systematic by selecting the model according 
(for example) to the AICC criterion (see Section 5.5.2). For further information see 
TSTM, Section 10.6. 


4.1. Show that 


f ei KA dx= 2x, ifk=h, 
-r 0, otherwise. 


4.2. If {Z,} ~ WN(0, o°), apply Corollary 4.1.1 to compute the spectral density of 
{Z} 
4.3. Show that the vectors e;,...,e, are orthonormal in the sense of (4.2.3). 


4.4. Use Corollary 4.1.1 to establish whether or not the following function is the 
autocovariance function of a stationary process {X,}: 


1 ifh =0, 

—0.5 ifh=+2, 
y(h) = 

—0.25 ifh = +3, 


0 otherwise. 
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4.5. 


4.6. 


4.7. 


4.8. 


If {X,} and {Y,} are uncorrelated stationary processes with autocovariance func- 
tions yx (-) and yy (-) and spectral distribution functions Fy (-) and Fy (-), respec- 
tively, show that the process {Z, = X, + Y,} is stationary with autocovariance 
function yz = yy + yy and spectral distribution function Fz = Fy + Fy. 


Let {X,} be the process defined by 
X, = Acos(at/3) + Bsin(wt/3) + Y,, 


where Y, = Z, + 2.5Z,-1, {Z} ~ WN(0, o°), A and B are uncorrelated with 
mean 0 and variance v?, and Z, is uncorrelated with A and B for each t. Find 
the autocovariance function and spectral distribution function of {X;}. 


Let {X,} denote the sunspot series filed as SUNSPOTS.TSM and let {Y,} denote 
the mean-corrected series Y, = X, — 46.93, t = 1,..., 100. Use ITSM to find 
the Yule-Walker AR(2) model 


Y, = QY, + @2¥,-2+Z;, {Z} ~ WN (0,07), 


i.e., find ¢1, 62, and o°. Use ITSM to plot the spectral density of the fitted 
model and find the frequency at which it achieves its maximum value. What is 
the corresponding period? 


a. Use ITSM to compute and plot the spectral density of the stationary series 
{X,} satisfying 


X,—0.99X,3=Z,, {Z} ~ WNO, 1). 


b. Does the spectral density suggest that the sample paths of {X,} will exhibit 
approximately oscillatory behavior? If so, then with what period? 


c. Use ITSM to simulate a realization of X1, ..., Xo and plot the realization. 
Does the graph of the realization support the conclusion of part (b)? Save the 
generated series as X.TSM by clicking on the window displaying the graph, 
then on the red EXP button near the top of the screen. Select Time Series 
and File in the resulting dialog box and click OK. You will then be asked to 
provide the file name, X.TSM. 


d. Compute the spectral density of the filtered process 


1 
Y, = gA +X + Xi41) 


and compare the numerical values of the spectral densities of {X,} and {Y;} 
at frequency w = 27/3 radians per unit time. What effect would you expect 
the filter to have on the oscillations of {X,}? 


e. Open the project X.TSM and use the option Smooth>Moving Ave. to apply 
the filter of part (d) to the realization generated in part (c). Comment on the 
result. 
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4.9. 


4.10. 


The spectral density of a real-valued time series {X,} is defined on [0, x] by 


100, ifz/6—.01 <à < x/6+.01, 
f= 


0, otherwise, 
and on [—z, 0] by f(A) = f(-A). 
a. Evaluate the ACVF of {X;} at lags 0 and 1. 
b. Find the spectral density of the process {Y,} defined by 


Y, := VX, = X; — X12- 


c. What is the variance of Y,? 


d. Sketch the power transfer function of the filter Vj. and use the sketch to 
explain the effect of the filter on sinusoids with frequencies (i) near zero and 
(ii) near 7/6. 


Suppose that {X,} is the noncausal and noninvertible ARMA(1,1) process sat- 
isfying 
X,— X1 = Z +0Z- {Z,}~WN (0, 0°), 
where |p| > 1 and |6| > 1. Define $(B) = 1 — 4B and 6(B) = 1 + 7B and let 
{W,} be the process given by 
W, := 0-'(B)$(B)X,. 
a. Show that {W,} has a constant spectral density function. 
b. Conclude that {W,} ~ WN(0, o2). Give an explicit formula for oå in terms 
of ¢, 6, and o°. 
c. Deduce that ¢(B)X, = 0(B)W,, so that {X,} is a causal and invertible 
ARMA(1,1) process relative to the white noise sequence { W;,}. 


Modeling and Forecasting 
with ARMA Processes 


5.1 Preliminary Estimation 

5.2 Maximum Likelihood Estimation 
5.3 Diagnostic Checking 

5.4 Forecasting 

5.5 Order Selection 


The determination of an appropriate ARMA(p, q) model to represent an observed 
stationary time series involves a number of interrelated problems. These include 
the choice of p and q (order selection) and estimation of the mean, the coefficients 
{i i = 1,..., p}, {0;,i = 1,...,q}, and the white noise variance o7. Final se- 
lection of the model depends on a variety of goodness of fit tests, although it can 
be systematized to a large degree by use of criteria such as minimization of the 
AICC statistic as discussed in Section 5.5. (A useful option in the program ITSM 
is Model>Estimation>Autofit, which automatically minimizes the AICC statistic 
over all ARMA(p, q) processes with p and q in a specified range.) 

This chapter is primarily devoted to the problem of estimating the parameters 
$ = Gi,.--,bp)h', 0 = (G,...,6,)h', and o? when p and q are assumed to be 
known, but the crucial issue of order selection is also considered. It will be assumed 
throughout (unless the mean is believed a priori to be zero) that the data have been 
“mean-corrected” by subtraction of the sample mean, so that it is appropriate to fit 
a zero-mean ARMA model to the adjusted data x1, ..., xn. If the model fitted to the 
mean-corrected data is 


¢(B)X, = 0(B)Z,, {Z} ~ WN (0, 0°), 
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then the corresponding model for the original stationary series {Y,} is found on re- 
placing X, for each t by Y, — y, where y = n`! ame y; is the sample mean of the 
original data, treated as a fixed constant. 

When p and q are known, good estimators of @ and 0 can be found by imagining 
the data to be observations of a stationary Gaussian time series and maximizing 
the likelihood with respect to the p + q + 1 parameters ¢),...,@p,91,..., 0, and 
o°. The estimators obtained by this procedure are known as maximum likelihood 
(or maximum Gaussian likelihood) estimators. Maximum likelihood estimation is 
discussed in Section 5.2 and can be carried out in practice using the ITSM option 
Model>Estimation>Max likelihood, after first specifying a preliminary model to 
initialize the maximization algorithm. Maximization of the likelihood and selection 
of the minimum AICC model over a specified range of p and g values can also be 
carried out using the option Mode1>Estimation>Autofit. 

The maximization is nonlinear in the sense that the function to be maximized 
is not a quadratic function of the unknown parameters, so the estimators cannot be 
found by solving a system of linear equations. They are found instead by searching 
numerically for the maximum of the likelihood surface. The algorithm used in ITSM 
requires the specification of initial parameter values with which to begin the search. 
The closer the preliminary estimates are to the maximum likelihood estimates, the 
faster the search will generally be. 

To provide these initial values, a number of preliminary estimation algorithms 
are available in the option Model>Estimation>Preliminary of ITSM. They are 
described in Section 5.1. For pure autoregressive models the choice is between Yule— 
Walker and Burg estimation, while for models with q > 0 itis between the innovations 
and Hannan-Rissanen algorithms. It is also possible to begin the search with an 
arbitrary causal ARMA model by using the option Mode1l>Specify and entering the 
desired parameter values. The initial values are chosen automatically in the option 
Model>Estimation>Autofit. 

Calculation of the exact Gaussian likelihood for an ARMA model (and in fact for 
any second-order model) is greatly simplified by use of the innovations algorithm. In 
Section 5.2 we take advantage of this simplification in discussing maximum likelihood 
estimation and consider also the construction of confidence intervals for the estimated 
coefficients. 

Section 5.3 deals with goodness of fit tests for the chosen model and Section 
5.4 with the use of the fitted model for forecasting. In Section 5.5 we discuss the 
theoretical basis for some of the criteria used for order selection. 

For an overview of the general strategy for model-fitting see Section 6.2. 


5.1 Preliminary Estimation 


In this section we shall consider four techniques for preliminary estimation of the 
parameters @ = (ġ1, .--, Øp), 0 = (01, -.-, bp)’, and o? from observations x1, ..., Xn 
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of the causal ARMA(p, q) process defined by 
$(B)X,=0(B)Z,, {Z} ~ WN(0, o°). (5.1.1) 


The Yule—Walker and Burg procedures apply to the fitting of pure autoregressive 
models. (Although the former can be adapted to models with g > 0, its performance 
is less efficient than when q = 0.) The innovation and Hannan-Rissanen algorithms 
are used in ITSM to provide preliminary estimates of the ARMA parameters when 
q>0. 

For pure autoregressive models Burg’s algorithm usually gives higher likelihoods 
than the Yule—Walker equations. For pure moving-average models the innovations 
algorithm frequently gives slightly higher likelihoods than the Hannan—Rissanen 
algorithm (we use only the first two steps of the latter for preliminary estimation). For 
mixed models (i.e., those with p > 0 and q > 0) the Hannan-Rissanen algorithm is 
usually more successful in finding causal models (which are required for initialization 
of the likelihood maximization). 


5.1.1 Yule—Walker Estimation 


For a pure autoregressive model the moving-average polynomial 0(z) is identically 
1, and the causality assumption in (5.1.1) allows us to write X, in the form 


KSS VAs, (5.1.2) 
7=0 
where, from Section 3.1, W(z) = )0j29 Wiz! = 1/6(z). Multiplying each side of 


(5.1.1) by X,_;, j = 0,1,2,..., p, taking expectations, and using (5.1.2) to evaluate 
the right-hand side of the first equation, we obtain the Yule—Walker equations 


r o=% (5.1.3) 
and 
o? = y(0) — b'%, (5.1.4) 
where T, is the covariance matrix [y (i — Dja and y, = (y (1), ..., y(p))’. These 
equations can be used to determine y(0),..., y (p) from o° and @. 
On the other hand, if we replace the covariances y(j), j =0,..., p, appearing 


in (5.1.3) and (5.1.4) by the corresponding sample covariances Ŷ(j), we obtain a set 
of equations for the so-called Yule—Walker estimators @ and ô? of ¢ and o°, namely, 


f, =ô (5.1.5) 
and 

ê? = P0) — bp, (5.1.6) 
where I’, = [FG — D]? and 4, = (20), ..., PY. 


igo 
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If 7(0) > 0, then ies is nonsingular for every m = 1, 2, ... (see TSTM, Problem 
7.11), so we can rewrite equations (5.1.5) and (5.1.6) in the following form: 


Sample Yule—Walker Equations: 

b= (d,..-.bp) = Êz'ô» (5.1.7) 
and 

ê? =7O|1-A,R;'p], (5.1.8) 


where pp = (A(1),..-, ACP) = ¥p/7 ©). 


With h as defined by (5.1.7), it can be shown that 1 — biz eee dpe? Æ 0 for 
|z| < 1 (see TSTM, Problem 8.3). Hence the fitted model 
X; = bi X11 =e — bp X1—p =Z, {Z} ~ WN(0, 6”) 
is causal. The autocovariances y,(h), h = 0,..., p, of the fitted model therefore 


satisfy the p + 1 linear equations 
0, h=1,..., p, 
ô’, h=0. 


However, from (5.1.5) and (5.1.6) we see that the solution of these equations is 
y, (h) = (h), h = 0,..., p, so that the autocovariances of the fitted model at lags 
0,1,..., p coincide with the corresponding sample autocovariances. 

The argument of the preceding paragraph shows that for every nonsingular covari- 
ance matrix of the form T p+ı = [y (i — j)] ee there is an AR(p) process whose auto- 
covariances at lags 0,..., pare y(O),..., y (p). (The required coefficients and white 
noise variance are found from (5.1.7) and (5.1.8) on replacing p(j) by y(j)/y (0), 
j =0,..., p,and (0) by y (0).) There may not, however, be an MA(p) process with 
this property. For example, if y (0) = 1 and y(1) = y(—1) = £, the matrix T» isa 
nonsingular covariance matrix for all 6 € (—1, 1). Consequently, there is an AR(1) 
process with autocovariances 1 and £ at lags 0 and 1 for all 6 € (—1, 1). However, 
there is an MA(1) process with autocovariances 1 and £ at lags 0 and 1 if and only if 


|B| < 4. (See Example 2.1.1.) 


It is often the case that moment estimators, i.e., estimators that (like Q) are ob- 
tained by equating theoretical and sample moments, have much higher variances than 
estimators obtained by alternative methods such as maximum likelihood. However, 
the Yule—Walker estimators of the coefficients 1, . . ., Øp of an AR(p) process have 
approximately the same distribution for large samples as the corresponding maxi- 
mun likelihood estimators. For a precise statement of this result see TSTM, Section 
8.10. For our purposes it suffices to note the following: 


y, (h) — diy, (h — 1) — +++ — bpy, (h — p) = 
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Large-Sample Distribution of Yule-Walker Estimators: 


For a large sample from an AR(p) process, 


o x N (¢, 0) ’ 


If we replace o? and I, by their estimates 6? and Î,, we can use this result to 
find large-sample confidence regions for @ and each of its components as in (5.1.12) 
and (5.1.13) below. 


Order Selection 

In practice we do not know the true order of the model generating the data. In fact, 
it will usually be the case that there is no true AR model, in which case our goal 
is simply to find one that represents the data optimally in some sense. Two useful 
techniques for selecting an appropriate AR model are given below. The second is 
more systematic and extends beyond the narrow class of pure autoregressive models. 


e Some guidance in the choice of order is provided by a large-sample result (see 
TSTM, Section 8.10), which states that if {X,} is the causal AR( p) process defined 
by (5.1.1) with {Z,} ~ iid(0, o°) and if we fit a model with order m > p using 
the Yule—Walker equations, i.e., if we fit a model with coefficient vector 


Pm = Ro Pn, m >p, 
then the last component, mm, of the vector dn is approximately normally dis- 


tributed with mean 0 and variance 1/n. Notice that mm is exactly the sample 
partial autocorrelation at lag m as defined in Section 3.2.3. 


Now, we already know from Example 3.2.6 that for an AR(p), process the partial 
autocorrelations ¢,,,, m > p, are zero. By the result of the previous paragraph, 
if an AR(p) model is appropriate for the data, then the values Dik k > p, should 
be compatible with observations from the distribution N(O, 1/7). In particular, 
for k > p, jx will fall between the bounds +1.96n~!/? with probability close to 
0.95. This suggests using as a preliminary estimator of p the smallest value m 
such that UA < 1.96n7!/ for k > m. 


The program ITSM plots the sample PACF {dria m=1,2,... } together with 
the bounds +1.96/,/n. From this graph it is easy to read off the preliminary 
estimator of p defined above. 


e A more systematic approach to order selection is to find the values of p and ¢, 
that minimize the AICC statistic (see Section 5.5.2 below) 


AICC = —2 In L(y, S(bp)/n) + 2(p + Dn/(n — p — 2), 
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Definition 5.1.1. 


where L is the Gaussian likelihood defined in (5.2.9) and S is defined in (5.2.11). 
The Preliminary Estimation dialog box of ITSM (opened by pressing the 
blue PRE button) allows you to search for the minimum AICC Yule—Walker (or 
Burg) models by checking Find AR model with min AICC. This causes the 
program to fit autoregressions of orders 0, 1, ..., 27 and to return the model with 
smallest AICC value. 


The fitted Yule-Walker AR(m) model is 

X, — bm X11 — +++ — bnmXi-m = Zr, {Zi} ~WN (0.9m), 5.1.9) 
where 

One (Pmi, aes bum) = Re Pm (5.1.10) 
and 

ôn = PO) [1 — Pn Rr Pn]. (5.1.11) 


For both approaches to order selection we need to fit AR models of gradually 
increasing order to our given data. The problem of solving the Yule-Walker equations 
with gradually increasing orders has already been encountered in a slightly different 
context in Section 2.5.1, where we derived a recursive scheme for solving the equa- 
tions (5.1.3) and (5.1.4) with p successively taking the values 1, 2,.... Here we can 
use exactly the same scheme (the Durbin—Levinson algorithm) to solve the Yule— 
Walker equations (5.1.5) and (5.1.6), the only difference being that the covariances 
in (5.1.3) and (5.1.4) are replaced by their sample counterparts. This is the algorithm 
used by ITSM to perform the necessary calculations. 


Confidence Regions for the Coefficients 

Under the assumption that the order p of the fitted model is the correct value, we can 
use the asymptotic distribution of o p to derive approximate large-sample confidence 
regions for the true coefficient vector h, and for its individual components ¢,;. Thus, 
if x7_,(p) denotes the (1 — œ) quantile of the chi-squared distribution with p degrees 
of freedom, then for large sample-size n the region 


[o eR : (8-0) fF, (d)-4) <0 ix? (5.1.12) 


contains @, with probability close to (1 — a). (This follows from Problem A.7 and 
the fact that /n ($ p— o p) is approximately normally distributed with mean 0 and 
covariance matrix Ok i) Similarly, if ®,_, denotes the (1 — a) quantile of the 


standard normal distribution and 0;; is the jth diagonal element of ô ls then for 


5.1 Preliminary Estimation 143 
large n the interval bounded by 
pj £ Pi-apn "a (5.1.13) 
contains ¢@,; with probability close to (1 — œ). 
Example 5.1.1 The Dow Jones Utilities Index, Aug. 28-Dec. 18, 1972; DOWJ.TSM 


The very slowly decaying positive sample ACF of the time series contained in the 
file DOWJ.TSM this time series suggests differencing at lag 1 before attempting to 
fit a stationary model. One application of the operator (1 — B) produces a new series 
{Y,} with no obvious deviations from stationarity. We shall therefore try fitting an AR 
process to this new series 


Y, = D, — Dy-1 


using the Yule—Walker equations. There are 77 values of Y,, which we shall denote 
by Yı, ..., Y7. (We ignore the unequal spacing of the original data resulting from 
the five-day working week.) The sample autocovariances of the series yı, ..., y77 are 
y (0) = 0.17992, (1) = 0.07590, 7 (2) = 0.04885, etc. 

Applying the Durbin—Levinson algorithm to fit successively higher-order autore- 
gressive processes to the data, we obtain 


du = BL) = 0.4219, 
ô = 7(0) [1 — 6°(D)] = 0.1479, 


bn = [P — $n? | /ir = 0.1138, 
én = ou — bid = 0.3739, 
by = 0; [1 = A = 0.1460. 


The sample ACF and PACF of the data can be displayed by pressing the second 
yellow button at the top of the ITSM window. They are shown in Figures 5.1 and 5.2, 
respectively. Also plotted are the bounds +1.96//77. Since the PACF values at lags 
greater than 1 all lie between the bounds, the first order-selection criterion described 
above indicates that we should fit an AR(1) model to the data set {Y,}. Unless we wish 
to assume that {Y,} is a zero-mean process, we should subtract the sample mean from 
the data before attempting to fit a (zero-mean) AR(1) model. When the blue PRE 
(preliminary estimation) button at the top of the ITSM window is pressed, you will 
be given the option of subtracting the mean from the data. In this case (as in most) 
click Yes to obtain the new series 


X, = Y, — 0.1336. 


You will then see the Preliminary Estimation dialog box. Enter 1 for the AR 
order, zero for the MA order, select Yule-Walker, and click OK. We have already 
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Figure 5-1 

The sample ACF of 
the differenced series 
{Y,} in Example 5.1.1. 


computed $1; and ĉ; above using the Durbin—Levinson algorithm. The Yule-Walker 
AR(1) model obtained by ITSM for {X;} is therefore (not surprisingly) 


X, —0.4219X,_; = Z, {Z} ~ WN, 0.1479), (5.1.14) 
and the corresponding model for {Y,} is 
Y, — 0.1336 — 0.4219(¥,_; — 0.1336) = Z,, {Z,} ~ WN(O, 0.1479). (5.1.15) 


Assuming that our observed data really are generated by an AR process with 
p = 1, (5.1.13) gives us approximate 95% confidence bounds for the autoregressive 
coefficient ¢, 


1. 147 
0.4219 + ea = (0.2194, 0.6244). 


(.17992)./77 

Besides estimating the autoregressive coefficients, ITSM computes and prints out 
the ratio of each coefficient to 1.96 times its estimated standard deviation. From these 
numbers large-sample 95% confidence intervals for each of the coefficients are easily 
obtained. In this particular example there is just one coefficient estimate, ġı = 0.4219, 
with ratio of coefficient to 1.96 xstandard error equal to 2.0832. Hence the required 
95% confidence bounds are 0.4219 + 0.4219/2.0832 = (0.2194, 0.6244), as found 
above. 

A useful technique for preliminary autoregressive estimation that incorporates 
automatic model selection (i.e., choice of p) is to minimize the AICC (see equation 
(5.5.4)) over all fitted autoregressions of orders 0 through 27. This is achieved by 
selecting both Yule-Walker and Find AR model with min AICC in the Prelim- 
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Figure 5-2 

The sample PACF of 
the differenced series 
{Yı} in Example 5.1.1. 


Example 5.1.2 


1.0 


0.8 
T 


PACF 


inary Estimation dialog box. (The MA order must be set to zero, but the AR 
order setting is immaterial.) Click OK, and the program will search through all the 
Yule—Walker AR(p) models, p = 0, 1, ... , 27, selecting the one with smallest AICC 
value. The minimum-AICC Yule-Walker AR model turns out to be the one defined 
by (5.1.14) with p = 1 and AICC value 74.541. 


Yule-Walker Estimation with q > 0; Moment Estimators 

The Yule—Walker estimates for the parameters in an AR(p) model are examples 
of moment estimators: The autocovariances at lags 0,1,..., p are replaced by the 
corresponding sample estimates in the Yule-Walker equations (5.1.3), which are 
then solved for the parameters o = (¢1,...,¢,)/ and o°. The analogous procedure 
for ARMA(p, q) models with q > 0 is easily formulated, but the corresponding 
equations are nonlinear in the unknown coefficients, leading to possible nonexistence 
and nonuniqueness of solutions for the required estimators. 


From (3.2.5), the equations to be solved for ¢;,..., 6), 01,..., Og and o? are 
q 
PE — bP K-1) — ++ pk- p) =O? Y Oa Osksp+a, (5.1.16) 
jak 


where y; must first be expressed in terms of @ and @ using the identity w(z) = 
6(z)/b(z) (69 := 1 and 6; = Y; = 0 for j < 0). 


For the MA(1) model the equations (5.1.16) are equivalent to 


PO) =ô? (1+8), (5.1.17) 
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(5.1.18) 


If |A(1)| > .5, there is no real solution, so we define 0, = A/A]. |6C)| <.5, 
then the solution of (5.1.17)-(5.1.18) (with |6| < 1) is 

ô, = (1 - (1 -= 40)"?) / 220), 

6? = f 0)/ (1 +6). 


For the overshort data of Example 3.2.8, 6(1) = —0.5035 and 7 (0) = 3416, so the 
fitted MA(1) model has parameters 6; = —1.0 and ô? = 1708. 


Relative Efficiency of Estimators 

The performance of two competing estimators is often measured by computing their 
asymptotic relative efficiency. In a general statistics estimation problem, suppose A 
and go are two estimates of the parameter 0 in the parameter space © based on the 
observations X,,..., X,.If 60 is approximately N(0, o? (6)) for large n, i = 1,2, 
then the asymptotic efficiency of 4” relative to 6 is defined to be 


03(0) 
o7(6) 


e (0,0, 0%) = 


If e(0, gM, 6) < 1 for all 0 € ©, then we say that 6 is a more efficient 
estimator of 0 than A (strictly more efficient if in addition, e(0, gO), 6) < 1 for 
some 0 € ©). For the MA(1) process the moment estimator 9”) discussed in Example 
5.1.2 is approximately N(6,, of? (@;)/n) with 


0? (61) = (1 + 67 + 404 + 0° + 68) /(1 — 67)" 


(see TSTM, p. 254). On the other hand, the innovations estimator 6 discussed in 
the next section is distributed approximately as N(6,, n~!). Thus, e(6,,0, 6) = 
o, (01) < 1 for all |@,| < 1, with strict inequality when 6 ¥ 1. In particular, 


82, 6,=.25, 
e (61,6,8) = 37, @, = 50, 
06, 6, =.75, 


demonstrating the superiority, at least in terms of asymptotic relative efficiency, of 
6 over 6. On the other hand (Section 5.2), the maximum likelihood estimator 6°) 
of 6; is approximately N(6,, (1 — 6?)/n). Hence, 


94, op 5. 
e (1,4, 6%) = 75, 6, = 50, 
44, 6, =.75. 
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While 6° is more efficient, 6 has reasonably good efficiency, except when |6,| is 
close to 1, and can serve as initial value for the nonlinear optimization procedure in 
computing the maximum likelihood estimator. 

While the method of moments is an effective procedure for fitting autoregres- 
sive models, it does not perform as well for ARMA models with g > 0. From a 
computational point of view, it requires as much computing time as the more effi- 
cient estimators based on either the innovations algorithm or the Hannan—Rissanen 
procedure and is therefore rarely used except when q = 0. 


5.1.2 Burg’s Algorithm 


The Yule—Walker coefficients bois raii Èpp are precisely the coefficients of the best 
linear predictor of X „+1 in terms of {X ,, . . . , Xı} under the assumption that the ACF 
of {X,} coincides with the sample ACF at lags 1,..., p. 

Burg’s algorithm estimates the PACF {¢ġ11, 22, . . .} by successively minimizing 
sums of squares of forward and backward one-step prediction errors with respect to the 


coefficients ¢;;. Given observations {x;, . . . , Xn} of a stationary zero-mean time series 
{X,} we define u; (t), t =i+1,...,n,0 <i < n, to be the difference between Xn+1+i—t 
and the best linear estimate of x„+1+;- in terms of the preceding i observations. 
Similarly, we define v;(t),f =i+1,...,n,0 <i <n, to be the difference between 


Xn+1-, and the best linear estimate of x,.,_; in terms of the subsequent i observations. 
Then it can be shown (see Problem 5.6) that the forward and backward prediction 
errors {u;(t)} and {v;(t)} satisfy the recursions 


Ug(t) = volt) = Xn+1-1, 
uj(t) = ui- (t — 1) — piuvi- (t), (5.1.19) 
and 
uj(t) = vi-1 (t) — Qirui- (t — 1). (5.1.20) 
Burg’s estimate oP of ġıı is found by minimizing 
gis aT lio +40) 


with respect to @1;. This gives corresponding numerical values for u(t) and vı (ft) 
and ø? that can then be substituted into (5.1.19) and (5.1.20) with i = 2. Then we 
minimize 
1 n 
2 2 2 
o; := ——— u5(t) + v5 (t) 

>" 2(n — 2) 3 [n 20] 
with respect to z to obtain the Burg estimate ee ? of $» and corresponding values 
of u(t), v2(t), and o2. This process can clearly be continued to obtain estimates on 
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Example 5.1.3 


and corresponding minimum values, gee p <n — 1. Estimates of the coefficients 
pj, 1 < j < p — 1, in the best linear predictor 


Pp X p+1 = dpi Xp ae te + Ppp X1 


are then found by substituting the estimates eo? i=l,..., p, for ¢;; in the recursions 
(2.5.20)-(2.5.22). The resulting estimates of ¢,;, j = 1,..., p, are the coefficient 
estimates of the Burg AR(p) model for the data {x,,...,x,}. The Burg estimate of 


the white noise variance is the minimum value o$”? found in the determination of 
¢~). The calculation of the estimates of ¢,, and o described above is equivalent 
(Problem 5.7) to solving the following recursions: 


Burg’s Algorithm: 


d(l) = $ t= D+ yO), 


t=2 


2 n 
(B) _ ———— : : = 
$j; = d(i) os vj-1(t)u;-1(¢ — 1), 


dli + 1) = (1-4) d@ — v? + 1) — urn), 


of? = [(1 - P”) d(i)| [2 — i). 


The large-sample distribution of the estimated coefficients for the Burg estimators 
of the coefficients of an AR(p) process is the same as for the Yule—Walker estimators, 
namely, N(¢, nlo’ T z L), Approximate large-sample confidence intervals for the 
coefficients can be found as in Section 5.1.1 by substituting estimated values for o? 
and T}. 


The Dow Jones Utilities Index 


The fitting of AR models using Burg’s algorithm in the program ITSM is completely 
analogous to the use of the Yule—Walker equations. Applying the same transfor- 
mations as in Example 5.1.1 to the Dow Jones Utilities Index and selecting Burg 
instead of Yule-Walker in the Preliminary Estimation dialog box, we obtain 
the minimum AICC Burg model 


X, —0.4371X,-1 = Z, {Z} ~ WN(O, 0.1423), (5.1.21) 


with AICC = 74.492. This is slightly different from the Yule-Walker AR(1) model 
fitted in Example 5.1.1, and it has a larger likelihood L, i.e., a smaller value of 
—2 In L (see Section 5.2). Although the two methods give estimators with the same 
large-sample distributions, for finite sample sizes the Burg model usually has smaller 
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estimated white noise variance and larger Gaussian likelihood. From the ratio of the 
estimated coefficient to (1.96 standard error) displayed by ITSM, we obtain the 
95% confidence bounds for ¢: 0.4371 + 0.4371/2.1668 = (0.2354, 0.6388). 


Example 5.1.4 The lake data 


This series {Y,;,t = 1,...,98} has already been studied in Example 1.3.5. In this 
example we shall consider the problem of fitting an AR process directly to the data 
without first removing any trend component. A graph of the data was displayed in 
Figure 1.9. The sample ACF and PACF are shown in Figures 5.3 and 5.4, respectively. 

The sample PACF shown in Figure 5.4 strongly suggests fitting an AR(2) model 
to the mean-corrected data X, = Y, — 9.0041. After clicking on the blue preliminary 
estimation button of ITSM select Yes to subtract the sample mean from {Y,}. Then 
specify 2 for the AR order, 0 for the MA order, and Burg for estimation. Click OK 
to obtain the model 


X, — 1.0449X,_; + 0.2456X,_. = Z;, {Z,} ~ WN(O, 0.4706), 
with AICC value 213.55 and 95% confidence bounds 

Qı : 1.0449 + 1.0449/5.5295 = (0.8559, 1.2339), 

Q2 : —0.2456 + 0.2456/1.2997 = (—0.4346, —0.0566). 
Selecting the Yule-Walker method for estimation, we obtain the model 


X, — 1.0538X,_; + 0.2668X,_. = Z;,,  {Z,} ~ WN(O, 0.4920), 
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Figure 5-3 “os | 
The sample ACF of the lake 0 10 20 30 40 
data in Example 5.1.4. Lag 
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Figure 5-4 
The sample PACF of the 
lake data in Example 5.1.4. 
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with AICC value 213.57 and 95% confidence bounds 
hı : 1.0538 + 1.0538/5.5227 = (0.8630, 1.2446), 
oo : —0.2668 + 0.2668/1.3980 = (—0.4576, —.0760). 


We notice, as in Example 5.1.3, that the Burg model again has smaller white noise 
variance and larger Gaussian likelihood than the Yule-Walker model. 

If we determine the minimum AICC Yule—Walker and Burg models, we find that 
they are both of order 2. Thus the order suggested by the sample PACF coincides 
again with the order obtained by AICC minimization. 


5.1.3. The Innovations Algorithm 


Just as we can fit autoregressive models of orders 1, 2, . . . to the data {x,,..., x,} by 
applying the Durbin—Levinson algorithm to the sample autocovariances, we can also 
fit moving average models 


Xi = Zi + Om Zii + +++ + ÎÔnmZi-m {Z1} ~ WN (0, on) (5.1.22) 
of orders m = 1,2,... by means of the innovations algorithm (Section 2.5.2). The 
estimated coefficient vectors Om := (Ant, Erans Ôm) and white noise variances Òm, 
m = 1,2,..., are specified in the following definition. (The justification for using 


estimators defined in this way is contained in Remark 1 following the definition.) 


5.1 


Preliminary Estimation 151 


Definition 5.1.2 


The fitted innovations MA(m) model is 
Xx, = Z, + Îmi Z1 +- H Ômm Zim» {Z,} ~ WN, Ôn), 


where On and v,, are obtained from the innovations algorithm with the ACVF 
replaced by the sample ACVF. 


Remark 1. It can be shown (see Brockwell and Davis, 1988) that if {X,} is an 
invertible MA(q) process 


Xx; = Zi +0 Z1 + +++ +O0gZ1-q, {Z} ~ IID (0, 0°), 


with E zí < oo, and if we define 6) = 1 and 6; = 0 for j > q, then the innovation 
estimates have the following large-sample properties. If n — oo and m(n) is any 
sequence of positive integers such that m(n) > oo but n~'3m(n) — 0, then for each 
positive integer k the joint distribution function of 


m a N 1 
ni? (ôni = 0i, Om2 = b2, e.’ Omk aa 6x) 


converges to that of the multivariate normal distribution with mean 0 and covariance 


matrix A = [a;;]f ;_;, where 
min(i, j) 
Gig = S Oss (5.1.23) 
r=1 


This result enables us to find approximate large-sample confidence intervals for the 
moving-average coefficients from the innovation estimates as described in the exam- 
ples below. Moreover, the estimator îm is consistent for o° in the sense that for every 
e€ > 0, P([ôn — o?| > €) > 0 as m > oo. 


Remark 2. Although the recursive fitting of moving-average models using the 
innovations algorithm is closely analogous to the recursive fitting of autoregressive 
models using the Durbin—Levinson algorithm, there is one important distinction. For 
an AR(p) process the Yule-Walker and Burg estimators Q p are consistent estimators 
of (¢1,...,,)' as the sample size n — oo. However, for an MA(q) process the 
estimator 6, = (091, ---, gq)’ is not consistent for (04, ..., 6,)’. For consistency it is 
necessary to use the estimators (9,1, - - - , Ong)’ With m(n) satisfying the conditions of 
Remark 1. The choice of m for any fixed sample size can be made by increasing m 
until the vector (@n1,..., Ong)’ stabilizes. It is found in practice that there is a large 
range of values of m for which the fluctuations in @,,; are small compared with the 
estimated asymptotic standard deviation n? (Si 62,)"/ 7 as found from (5.1.23) 


when the coefficients 0; are replaced by their estimated values 6,,,;. 
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Order Selection 

Three useful techniques for selecting an appropriate MA model are given below. The 
third is more systematic and extends beyond the narrow class of pure moving-average 
models. 


We know from Section 3.2.2 that for an MA(q) process the autocorrelations p(m), 

m > q, are zero. Moreover, we know from Bartlett’s formula (Section 2.4) that 
the sample autocorrelation 6(m), m > q, is approximately normally distributed 
with mean p(m) = 0 and variance n~'[1 + 2p?(1) + --- + 27(q)]. This result 
enables us to use the graph of p(m), m = 1,2,..., both to decide whether or 
not a given data set can be plausibly modeled by a moving-average process and 
also to obtain a preliminary estimate of the order g as the smallest value of m 
such that p(k) is not significantly different from zero for all k > m. For practical 
purposes “significantly different from zero” is often interpreted as “larger than 
1.96/,/n in absolute value” (cf. the corresponding approach to order selection 
for AR models based on the sample PACF and described in Section 5.1.1). 


If in addition to examining p(m),m = 1, 2, ..., we examine the coefficient vectors 
6,,,m = 1,2,..., we are able not only to assess the appropriateness of a moving- 
average model and estimate its order q, but at the same time to obtain preliminary 
estimates 6,,1,..., Ông of the coefficients. By inspecting the estimated coefficients 
Omnis ---, mm for m = 1,2,... and the ratio of each coefficient estimate 6,, j to 
1.96 times its approximate standard deviation o; = n~"/?[ 7/7) 62. ] "2 we can 
see which of the coefficient estimates are most significantly different from zero, 
estimate the order of the model to be fitted as the largest lag j for which the ratio 
is larger than 1 in absolute value, and at the same time read off estimated values 
for each of the coefficients. A default value of m is set by the program, but it may 
be altered manually. As m is increased the values Ênis ---, Omm stabilize in the 
sense that the fluctuations in each component are of order n~'/?, the asymptotic 


standard deviation of 0m1. 


As for autoregressive models, a more systematic approach to order selection for 
moving-average models is to find the values of q and 6, = (On1,.--, Bn) that 
minimize the AICC statistic 


AICC = —21n L(y, S(0,)/n) +2(q + n/n — q — 2), 


where L is the Gaussian likelihood defined in (5.2.9) and S is defined in (5.2.11). 
(See Section 5.5 for further details.) 


Confidence Regions for the Coefficients 
Asymptotic confidence regions for the coefficient vector 0, and for its individual 
components can be found with the aid of the large-sample distribution specified in 
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Remark 1. For example, approximate 95% confidence bounds for 6; are given by 


jai 1/2 
Ôm; + 1.9677"? ie oy (5.1.24) 


i=0 
The Dow Jones Utilities Index 


In Example 5.1.1 we fitted an AR(1) model to the differenced Dow Jones Utilities 
Index. The sample ACF of the differenced data shown in Figure 5.1 suggests that 
an MA(2) model might also provide a good fit to the data. To apply the innovation 
technique for preliminary estimation, we proceed as in Example 5.1.1 to difference 
the series DOWJ.TSM to obtain observations of the differenced series {Y,}. We then 
select preliminary estimation by clicking on the blue PRE button and subtract the 
mean of the differences to obtain observations of the differenced and mean-corrected 
series {X,}. In the Preliminary Estimation dialog box enter 0 for the AR order 
and 2 for the MA order, and select Innovations as the estimation method. We must 
then specify a value of m, which is set by default in this case to 17. If we accept the 
default value, the program will compute Bias Sevens 617, 17 and print out the first two 
values as the estimates of 6, and 62, together with the ratios of the estimated values 
to their estimated standard deviations. These are 


MA COEFFICIENT 
4269 .2704 
COEFFICIENT/(1.96*STANDARD ERROR) 
1.9114 1.1133 


The remaining parameter in the model is the white noise variance, for which two 
estimates are given: 


WN VARIANCE ESTIMATE = (RESID SS)/N 
.1470 
INNOVATION WN VARIANCE ESTIMATE 
.1122 


The first of these is the average of the squares of the rescaled one-step prediction 
errors under the fitted MA(2) model, i.e., 4 2A (x;-X y /r;—1. The second value 
is the innovation estimate, 7. (By default ITSM retains the first value. If you wish 
instead to use the innovation estimate, you must change the white noise variance by 
selecting Model>Specify and setting the white noise value to the desired value.) The 
fitted model for X, (= Y, — .1336) is thus 


X, = Z,+0.4269Z,_; +0.2704Z,_., {Z} ~ WN(O, 0.1470), 


with AICC = 77.467. 

To see all 17 estimated coefficients iz j} =1,..., 17, werepeat the preliminary 
estimation, this time fitting an MA(17) model with m = 17. The coefficients and ratios 
for the resulting model are found to be as follows: 


154 


Chapter 5 


Modeling and Forecasting with ARMA Processes 


MA COEFFICIENT 


4269 .2704 .1183 .1589 .1355 .1568 .1284 —.0060 
.0148 —.0017 .1974 —.0463 .2023 1285 —.0213 —.2575 
.0760 
COEFFICIENT/(1.96*STANDARD ERROR) 
1.9114 1.1133 4727 .6314 5331 .6127 4969 —.0231 
.0568 —.0064 7594 —.1757 .7667 4801 —.0792 —.9563 
.2760 


The ratios indicate that the estimated coefficients most significantly different 
from zero are the first and second, reinforcing our original intention of fitting an 
MA(2) model to the data. Estimated coefficients Êm; for other values of m can be 
examined in the same way, and it is found that the values obtained for m > 17 change 
only slightly from the values tabulated above. 

By fitting MA (q) models of orders 0, 1, 2, . . . , 26 using the innovations algorithm 
with the default settings for m, we find that the minimum AICC model is the one with 
q = 2 found above. Thus the model suggested by the sample ACF again coincides 
with the more systematically chosen minimum AICC model. 


Innovations Algorithm Estimates when p > 0 and q > 0 
The causality assumption (Section 3.1) ensures that 


CO 
X= >" V 
j=0 


where the coefficients y; satisfy 


min(j, p) 
USO Yo OA ay FS ON Gu, (5.1.25) 
i=l 
and we define 6) := 1 and 6; := 0 for j > q. To estimate w,..., Wp4q we can use 
the innovation estimates 0m1, ..., 9m,p+q, Whose large-sample behavior is specified in 


Remark 1. Replacing y; by Bn j in (5.1.25) and solving the resulting equations 


min(j, p) 
On ES Gbara GH... Pt, (5.1.26) 


i=l 


for ¢ and 0, we obtain initial parameter estimates o and Ô. To solve (5.1.26) we first 
find @ from the last g equations: 


= | | Oio E DE On yea pı 
m,q+2 On, +1 On, SS Oi +2— Q2 
|= : 3 SHE A 
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Having solved (5.1.27) for @ (which may not be causal), we can easily determine the 
estimate of 0 from 


Finally, the white noise variance o° is estimated by 


n RN 
6? SA: (x, = x,) /Vi-1, 


t=1 


where X, is the one-step predictor of X, computed from the fitted coefficient vectors 
o and 6, and r,—; is defined in (3.3.8). 

The above calculations can all be carried out by selecting the ITSM option Mode1> 
Estimation>Preliminary. This option also computes, if p = q, the ratio of each 
estimated coefficient to 1.96 times its estimated standard deviation. Approximate 95% 
confidence intervals can therefore easily be obtained in this case. If the fitted model 
is noncausal, it cannot be used to initialize the search for the maximum likelihood 
estimators, and so the autoregressive coefficients should be set to some causal values 
(e.g., all equal to .001) using the Model>Specify option. If both the innovation and 
Hannan-Rissanen algorithms give noncausal models, it is an indication (but not a 
conclusive one) that the assumed values of p and q may not be appropriate for the 
data. 


Order Selection for Mixed Models 

For models with p > Oandq > 0, the sample ACF and PACF are difficult to recognize 
and are of far less value in order selection than in the special cases where p = 0 or 
q = 0. A systematic approach, however, is still available through minimization of 
the AICC statistic 


AICC = —2 In L(y, 04, S(@p, 94)/n) + 2(p +q + n/n — p—q—2), 


which is discussed in more detail in Section 5.5. For fixed p and q it is clear from the 
definition that the AICC value is minimized by the parameter values that maximize 
the likelihood. Hence, final decisions regarding the orders p and q that minimize 
AICC must be based on maximum likelihood estimation as described in Section 5.2. 


The lake data 


In Example 5.1.4 we fitted AR(2) models to the mean corrected lake data using the 
Yule—Walker equations and Burg’s algorithm. If instead we fit an ARMA(1,1) model 
using the innovations method in the option Model>Estimation>Preliminary of 
ITSM (with the default value m = 17), we obtain the model 


X, — 0.7234X,_) = Z, + 0.3596Z,_,, {Z,} ~ WN(O, 0.4757), 
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for the mean-corrected series X, = Y, — 9.0041. The ratio of the two coefficient 
estimates œ and @ to 1.96 times their estimated standard deviations are given by ITSM 
as 3.2064 and 1.8513, respectively. The corresponding 95% confidence intervals are 
therefore 


o : 0.7234 + 0.7234/3.2064 = (0.4978, 0.9490), 


0 : 0.3596 + 0.3596/1.8513 = (0.1654, 0.5538). 


It is interesting to note that the value of AICC for this model is 212.89, which is 
smaller than the corresponding values for the Burg and Yule-Walker AR(2) mod- 
els in Example 5.1.4. This suggests that an ARMA(1,1) model may be superior to 
a pure autoregressive model for these data. Preliminary estimation of a variety of 
ARMA(p, q) models shows that the minimum AICC value does in fact occur when 
p = q = 1. (Before committing ourselves to this model, however, we need to com- 
pare AICC values for the corresponding maximum likelihood models. We shall do 
this in Section 5.2.) 


5.1.4 The Hannan-Rissanen Algorithm 


The defining equations for a causal AR (p) model have the form of a linear regression 
model with coefficient vector œ = (¢1,...,@,)’. This suggests the use of simple 
least squares regression for obtaining preliminary parameter estimates when q = 0. 
Application of this technique when q > 0 is complicated by the fact that in the gen- 
eral ARMA(p, q) equations X, is regressed not only on X,_;,..., X;—p, but also on 
the unobserved quantities Z,_;,..., Z;-,. Nevertheless, it is still possible to apply 
least squares regression to the estimation of @ and @ by first replacing the unobserved 
quantities Z,_;,..., Z;-, in (5.1.1) by estimated values Dia. salt Dani The parame- 
ters @ and @ are then estimated by regressing X, onto X;_1,..., X:—p, Tics pk. Ta 
These are the main steps in the Hannan—Rissanen estimation procedure, which we 
now describe in more detail. 


Step 1. A high-order AR(m) model (withm > max(p, q)) is fitted to the data using the 
Yule—Walker estimates of Section 5.1.1. If (bmi. beans dm) is the vector of estimated 
coefficients, then the estimated residuals are computed from the equations 


De NOG = Oe age t=m+1,...,n. 
Step 2. Once the estimated residuals Dis t=m-+1,...,n, have been computed as 
in Step 1, the vector of parameters, 8 = (¢’, oy is estimated by least squares linear 
regression of X, onto (X,-1,..., X:-p, Zi-1,---, Zh t =mM+14q,...,n, ie., 


by minimizing the sum of squares 


n 7 he 2 
SB) = X (Xi b1X a — + = bp Xp — Za — 2a) 


t=m+1+q 
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with respect to 6. This gives the Hannan—Rissanen estimator 
B= (Z'Z)'ZX,, 
where X,, = (Xingitg,--+, Xn)’ and Z is the (n — m — q) x (p + q) matrix 


X m+q Xm+q-1 — Xm q+1-p Zm+q Zm+q-1 iar Zm+1 
Z X m+q+1 X m+q x Xm q+2-p Zm 4+1 Zm +q Zm42 
| Xn-1 Xn-2 aed Xn-p Zn-1 Zn-2 es Zn—q | 


(If p = 0, Z contains only the last q columns.) The Hannan-Rissanen estimate of 
the white noise variance is 


The lake data 


In Example 5.1.6 an ARMA(1,1) model was fitted to the mean corrected lake data 
using the innovations algorithm. We can fit an ARMA(1,1) model to these data using 
the Hannan-Rissanen estimates by selecting Hannan-Rissanen in the Preliminary 
Estimation dialog box of ITSM. The fitted model is 


X, — 0.6961X,_; = Z, + 0.3788Z,_1, {Z} ~ WN(O, 0.4774), 
for the mean-corrected series X, = Y,—9.0041. (Two estimates of the white noise vari- 
ance are computed in ITSM for the Hannan-Rissanen procedure, 6, and DaX 


X,_;)?/n. The latter is the one retained by the program.) The ratios of the two co- 
efficient estimates to 1.96 times their standard deviation are 4.5289 and 1.3120, 
respectively. The corresponding 95% confidence bounds for ¢ and 0 are 


o : 0.6961 + 0.6961 /4.5289 = (0.5424, 0.8498), 
0 : 0.3788 + 0.3788/1.3120 = (0.0901, 0.6675). 


Clearly, there is little difference between this model and the one fitted using the 
innovations method in Example 5.1.6. (The AICC values are 213.18 for the current 
model and 212.89 for the model fitted in Example 5.1.6.) 


Hannan and Rissanen include a third step in their procedure to improve the 
estimates. 


Step 3. Using the estimate B = (di, ols bp. 61, nae AN from Step 2, set 
0, ift < max(p, q), 


Ž, = a Bie l 
|x- XY Â; Žo ift > max(p, q). 
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Now fort = 1,..., put 


0, if t < max(p, q), 
V, = Pay = . , 
0 OiM-7+Z, ift > max(p,q) 


and 


0, if t < max(p, q), 


W, = 2a = 
i —~)°6;W,;+Z,, ift > max(p,q). 
j=l 


(Observe that both V, and W, satisfy the AR recursions $(B)V, = Z,and6(B)W, = Z, 
fort = 1,...,n.) If 8" is the regression estimate of 8 found by regressing Z; on 
(Vizi; ---, Vip, Wi-1, ++ +s Wi—q), Le, if Bt minimizes 


n 


2 
Pp 4 
See >) (2 = BW Fora} 
j=l k=1 


t=max(p,g)+1 


then the improved estimate of 6 is B = Bt + b. The new estimator B then has the 
same asymptotic efficiency as the maximum likelihood estimator. In ITSM, however, 
we eliminate Step 3, using the model produced by Step 2 as the initial model for the 
calculation (by numerical maximization) of the maximum likelihood estimator itself. 


5.2 Maximum Likelihood Estimation 


Suppose that {X,} is a Gaussian time series with mean zero and autocovariance 
function «(i, j) = E(X;X;). Let X, = (X1,...,X,)/ and let X, = (X1,..., Xn)’, 
where Xi = 0 and X = E(X;|Xi,..., Xj-1) = Pj-1Xj, j => 2. Let IT, denote the 
covariance matrix I’, = E(X,,X/,), and assume that T, is nonsingular. 

The likelihood of X, is 


1 
LT,) = 2r)” (det r,) ~! exp (-parrx) ; (5.2.1) 


As we shall now show, the direct calculation of det T„ and I>! can be avoided by 
expressing this in terms of the one-step prediction errors X; — X; and their vari- 


ances vj-1, j = 1,...,n, both of which are easily calculated recursively from the 
innovations algorithm (Section 2.5.2). 
Let 6;,7 = 1,...,i;i = 1,2,..., denote the coefficients obtained when the 


innovations algorithm is applied to the autocovariance function « of {X,}, and let C,, 
be the n x n lower triangular matrix defined in Section 2.5.2. From (2.5.27) we have 
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the identity 


X, =C, (X, — $, ) . (5.2.2) 


We also know from Remark 5 of Section 2.5.2 that the components of X,„ — Š, 
are uncorrelated. Consequently, by the definition of v;, X, — X, has the diagonal 
covariance matrix 


D, = diag{vo, ..., Un—1}- 
From (5.2.2) and (A.2.5) we conclude that 

Pa = Cp D, C). (5.2.3) 
From (5.2.2) and (5.2.3) we see that 


n 


XTX, = (X, -$,) Dz! (X. -$,) = D (%- a Jv; (5.2.4) 


j=l 
and 
det T,„ = (det C,,)?(det D,,) = Vov +++ Unt. (5.2.5) 


The likelihood (5.2.1) of the vector X,, therefore reduces to 


1 1 ~ \2 

LT) = ———— exp | 5 3 (x; - %)) if (5.2.6) 
If T„ is expressible in terms of a finite number of unknown parameters 6,,..., B, 
(as is the case when {X;} is an ARMA(p, q) process), the maximum likelihood 
estimators of the parameters are those values that maximize L for the given data 
set. When Xj, X2,..., Xn are iid, it is known, under mild assumptions and for n 
large, that maximum likelihood estimators are approximately normally distributed 
with variances that are at least as small as those of other asymptotically normally 
distributed estimators (see, e.g., Lehmann, 1983). 

Even if {X,} is not Gaussian, it still makes sense to regard (5.2.6) as a measure of 
goodness of fit of the model to the data, and to choose the parameters £1, . . . , 6, in such 
a way as to maximize (5.2.6). We shall always refer to the estimators Bi, ee B, so ob- 
tained as “maximum likelihood” estimators, even when {X,} is not Gaussian. Regard- 
less of the joint distribution of X,,..., X,,, we shall refer to (5.2.1) and its algebraic 
equivalent (5.2.6) as the “likelihood” (or “Gaussian likelihood”) of X,,..., X,. A jus- 
tification for using maximum Gaussian likelihood estimators of ARMA coefficients is 
that the large-sample distribution of the estimators is the same for {Z,} ~ IID (0, o°), 
regardless of whether or not {Z,} is Gaussian (see TSTM, Section 10.8). 

The likelihood for data from an ARMA(j, q) process is easily computed from 
the innovations form of the likelihood (5.2.6) by evaluating the one-step predictors 
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A 


X;4, and the corresponding mean squared errors v;. These can be found from the 
recursions (Section 3.3) 


Xoy (ej = Sus) , l<n<m, 

jel 

Xas ; (5.2.7) 
O1Xp Se se PpXn+1-p ate So nj (Xe; = Rus) > nam, 


j=l 
and 
A 2 A 2 
E (Xarı zE Rani) — oE (Wasi = Writ) oa oTa, (5.2.8) 


where 6,,; and r, are determined by the innovations algorithm with « as in (3.3.3) 
and m = max(p, q). Substituting in the general expression (5.2.6), we obtain the 


following: 
The Gaussian Likelihood for an ARMA Process: 
PRD 
a ie) 
L (9. 0,0 ) = exp 397 (5.2.9) 
202)" Tort Tn-1 Oo" j=l rj-l 


Differentiating In L (¢. 0, o°) partially with respect to o? and noting that X j and r; 


are independent of o°, we find that the maximum likelihood estimators ĝ, 6, and 6? 
satisfy the following equations (Problem 5.8): 


Maximum Likelihood Estimators: 


62=n"'s (4. ô) , (5.2.10) 
where 
s (9. ô) =X (8-8) fro (5.2.11) 


and ĝ. 6 are the values of $, 0 that minimize 


L$, 8) = In (nS, 0) +n! SV Inj. (5.2.12) 
j=l 


Minimization of £(@, 0) must be done numerically. Initial values for @ and 6 can 
be obtained from ITSM using the methods described in Section 5.1. The program then 
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searches systematically for the values of ġ and 0 that minimize the reduced likelihood 
(5.2.12) and computes the corresponding maximum likelihood estimate of o? from 
(5.2.10). 


Least Squares Estimation for Mixed Models 

The least squares estimates @ and 0 of ¢ and 0 are obtained by minimizing the function 
S as defined in (5.2.11) rather than £ as defined in (5.2.12), subject to the constraints 
that the model be causal and invertible. The least squares estimate of o is 


S(o,0 
oe = 54.8) 

n—p—dqd 
Order Selection 


In Section 5.1 we introduced minimization of the AICC value as a major criterion for 
the selection of the orders p and q. This criterion is applied as follows: 


AICC Criterion: 


Choose p, q, ¢,, and @, to minimize 


AICC = —21n L(g, 04, S(p, 9,)/n) + 2(p +q + n/n — p—q —2). 


For any fixed p and q it is clear that the AICC is minimized when @, and @, are 
the vectors that minimize —2 In L(@ sia 6,, S(O i 0,)/n), i.e., the maximum likelihood 
estimators. Final decisions with respect to order selection should therefore be made on 
the basis of maximum likelihood estimators (rather than the preliminary estimators of 
Section 5.1, which serve primarily as a guide). The AICC statistic and its justification 
are discussed in detail in Section 5.5. 

One of the options in the program ITSM is Model>Estimation>Autofit. Se- 
lection of this option allows you to specify a range of values for both p and q, after 
which the program will automatically fit maximum likelihood ARMA(p, q) values 
for all p and q in the specified range, and select from these the model with smallest 
AICC value. This may be slow if a large range is selected (the maximum range is from 
0 through 27 for both p and q), and once the model has been determined, it should 
be checked by preliminary estimation followed by maximum likelihood estimation 
to minimize the risk of the fitted model corresponding to a local rather than a global 
maximum of the likelihood. (For more details see Appendix D.3.1.) 


Confidence Regions for the Coefficients 

For large sample size the maximum likelihood estimator B of B := (¢1,..., bp, 
0i, ...,9,)' is approximately normally distributed with mean 8 and covariance ma- 
trix [n~'V(@)] which can be approximated by 2H~'(3), where H is the Hessian 
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Example 5.2.1 


Example 5.2.2 


matrix [0°2(B) /0B; 0B ale aoe ITSM prints out the approximate standard deviations 
and correlations of the coefficient estimators based on the Hessian matrix evaluated 
numerically at Ê unless this matrix is not positive definite, in which case ITSM instead 
computes the theoretical asymptotic covariance matrix in Section 8.8 of TSTM. The 
resulting covariances can be used to compute confidence bounds for the parameters. 


Large-Sample Distribution of Maximum Likelihood Estimators: 


For a large sample from an ARMA (p, q) process, 
BXN(6,n'V(p)). 


The general form of V (6) can be found in TSTM, Section 8.8. The following are 
several special cases. 


An AR(p) model 


The asymptotic covariance matrix in this case is the same as that for the Yule-Walker 
estimates given by 


Vid)=eT, 
In the special cases p = 1 and p = 2, we have 
AR(1) :V ($) = (1 — $7), 


1-¢; —$(1 + do) 
—$ (1 + Q2) 1-¢ | 


ARQ) :V(@) = | 


An MA(q) model 


Let r; be the covariance matrix of Y;,..., Y,, where {Y,} is the autoregressive process 
with autoregressive polynomial 6(z), i.e., 


Y, + 0Y, he? + O,% 2g = Ze {Z} ~ WNO, 1). 
Then it can be shown that 
VO =r}. 
Inspection of the results of Example 5.2.1 and replacement of ¢; by —0;) yields 


MA(1) :V(6) = (1 — 67), 


S 2 s 
movo- ga e] 
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Example 5.2.3 


Example 5.2.4 


Example 5.2.5 


An ARMA(1, 1) model 


For a causal and invertible ARMA(1,1) process with coefficients ø and 8. 


1+0 | A-0 +40) -0-0-4 
(@+6)?| -0-00-44 A-0 +40) 


VO, = 


The Dow Jones Utilities Index 


For the Burg and Yule—Walker AR(1) models derived for the differenced and mean- 
corrected series in Examples 5.1.1 and 5.1.3, the Model>Estimation>Preliminary 
option of ITSM gives —2 In(L) = 70.330 for the Burg model and —2 In(L) = 70.378 
for the Yule-Walker model. Since maximum likelihood estimation attempts to mini- 
mize —2 ln L, the Burg estimate appears to be a slightly better initial estimate of ¢. 
We therefore retain the Burg AR(1) model and then select Model>Estimation>Max 
Likelihood and click OK. The Burg coefficient estimates provide initial parameter 
values to start the search for the minimizing values. The model found on completion 
of the minimization is 


Y, — 0.4471Y,-; = Z, {Z} ~ WN(O, 0.02117). (5.2.13) 


This model is different again from the Burg and Yule-Walker models. It has 
—21n(L) = 70.321, corresponding to a slightly higher likelihood. The standard 
error (or estimated standard deviation) of the estimator d is found from the pro- 
gram to be 0.1050. This is in good agreement with the estimated standard deviation 
V — (.4471)2)/77 = .1019, based on the large-sample approximation given in Ex- 
ample 5.2.1. Using the value computed from ITSM, approximate 95% confidence 
bounds for ġ are 0.4471 + 1.96 x 0.1050 = (0.2413, 0.6529). These are quite close 
to the bounds based on the Yule—Walker and Burg estimates found in Examples 5.1.1 
and 5.1.3. To find the minimum-AICC model for the series {Y,}, choose the option 
Model>Estimation>Autofit. Using the default range for both p and q, and clicking 
on Start, we quickly find that the minimum AICC ARMA(p, q) model with p < 5 
and q < 5is the AR(1) model defined by (5.2.13). The corresponding AICC value is 
74.483. If we increase the upper limits for p and q, we obtain the same result. 


The lake data 


Using the option Mode1l>Estimation>Autofit as in the previous example, we find 
that the minimum-AICC ARMA(p, q) model for the mean-corrected lake data, X, = 
Y, — 9.0041, of Examples 5.1.6 and 5.1.7 is the ARMA(1,1) model 


X, — 0.7446X,_; = Z, + 0.3213Z,_1;, {Z,} ~ WN(O, 0.4750). (5.2.14) 


The estimated standard deviations of the two coefficient estimates d and 6 are found 
from ITSM to be 0.0773 and 0.1123, respectively. (The respective estimated standard 
deviations based on the large-sample approximation given in Example 5.2.3 are .0788 
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and .1119.) The corresponding 95% confidence bounds are therefore 


o : 0.7446 + 1.96 x 0.0773 = (0.5941, 0.8961), 


0 : 0.3208 + 1.96 x 0.1123 = (0.1007, 0.5409). 


The value of AICC for this model is 212.77, improving on the values for the prelim- 
inary models of Examples 5.1.4, 5.1.6, and 5.1.7. 


5.3 Diagnostic Checking 


Typically, the goodness of fit of a statistical model to a set of data is judged by 
comparing the observed values with the corresponding predicted values obtained 
from the fitted model. If the fitted model is appropriate, then the residuals should 
behave in a manner that is consistent with the model. 

When we fit an ARMA(p, q) model to a given series we determine the maximum 
likelihood estimators È, 6, and ô? of the parameters @, 0, and o”. In the course of this 


procedure the predicted values x i (d, 6) of X, based on X,,..., X;_, are computed 
for the fitted model. The residuals are then defined, in the notation of Section 3.3, by 
és ey ee ie aN RAY? 
W, = (x, ~ ¥, (4, 6)) / (i (4, ô)) n painui (5.3.1) 


If we were to assume that the maximum likelihood ARMA(p, q) model is the true pro- 
cess generating {X,}, then we could say that { W,} ~ WN (0, 67). However, to check 
the appropriateness of an ARMA(p, q) model for the data we should assume only 
that X;,..., Xn are generated by an ARMA(p, q) process with unknown parameters 
$, 0, and o?, whose maximum likelihood estimators are , Ô, and 6?, respectively. 
Then it is not true that {W,} is white noise. Nonetheless W,, t = 1,...,”, should 
have properties that are similar to those of the white noise sequence 


We, 8) = (Xi -= Xi e D) (P.O), t= Meee. 

Moreover, W,(@, 0) approximates the white noise term in the defining equation (5.1.1) 
in the sense that E (W, (ġ, 0) — Z,)? > 0 as t > œ (TSTM, Section 8.11). Conse- 
quently, the properties of the residuals [w] should reflect those of the white noise 
sequence {Z,} generating the underlying ARMA(p, q) process. In particular, the se- 
quence {W,} should be approximately (i) uncorrelated if {Z,} ~ WN(0, o°), (ii) 
independent if {Z,} ~ IID (0, o°), and (iii) normally distributed if Z, ~ N(0, o°). 

The rescaled residuals R,,f = 1, ...,, are obtained by dividing the residuals 
W,,t = 1,...,n, by the estimate 6 = (><, W2)/n of the white noise standard 
deviation. Thus, 


R, = W,/6. (5.3.2) 
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Figure 5-5 

The rescaled residuals after 
fitting the ARMA(1,1) 
model of Example 

5.2.5 to the lake data. 


If the fitted model is appropriate, the rescaled residuals should have properties similar 
to those of a WN(0, 1) sequence or of an 1id(0,1) sequence if we make the stronger 
assumption that the white noise {Z,} driving the ARMA process is independent white 
noise. 

The following diagnostic checks are all based on the expected properties of the 
residuals or rescaled residuals under the assumption that the fitted model is correct 
and that {Z,} ~ IID(0, o°). They are the same tests introduced in Section 1.6. 


5.3.1 The Graph of {R,,t = 1,...,n} 


If the fitted model is appropriate, then the graph of the rescaled residuals [R,, r= 
Tresan } should resemble that of a white noise sequence with variance one. While it is 
difficult to identify the correlation structure of {R, } (or any time series for that matter) 
from its graph, deviations of the mean from zero are sometimes clearly indicated by 
a trend or cyclic component and nonconstancy of the variance by fluctuations in R,, 
whose magnitude depends strongly on t. 

The rescaled residuals obtained from the ARMA(1,1) model fitted to the mean- 
corrected lake data in Example 5.2.5 are displayed in Figure 5.5. The graph gives no 
indication of a nonzero mean or nonconstant variance, so on this basis there is no 
reason to doubt the compatibility of R,,..., R, with unit-variance white noise. 

The next step is to check that the sample autocorrelation function of {W,} (or 
equivalently of {R, p behaves as it should under the assumption that the fitted model 
is appropriate. 
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Figure 5-6 

The sample ACF of 
the residuals after 
fitting the ARMA(1,1) 
model of Example 
5.2.5 to the lake data. 
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5.3.2 The Sample ACF of the Residuals 


We know from Section 1.6 that for large n the sample autocorrelations of an iid se- 
quence Yj,...,Y, with finite variance are approximately iid with distribution 
N(O, 1/n). We can therefore test whether or not the observed residuals are con- 
sistent with iid noise by examining the sample autocorrelations of the residuals and 
rejecting the iid noise hypothesis if more than two or three out of 40 fall outside the 
bounds +1.96/,/n or if one falls far outside the bounds. (As indicated above, our 
estimated residuals will not be precisely iid even if the true model generating the 
data is as assumed. To correct for this the bounds +1.96/,/n should be modified to 
give a more precise test as in Box and Pierce (1970) and TSTM, Section 9.4.) The 
sample ACF and PACF of the residuals and the bounds +1.96/./n can be viewed 
by pressing the second green button (Plot ACF/PACF of residuals) at the top of 
the ITSM window. Figure 5.6 shows the sample ACF of the residuals after fitting the 
ARMA(1,1) of Example 5.2.5 to the lake data. As can be seen from the graph, there 
is no cause to reject the fitted model on the basis of these autocorrelations. 


5.3.3 Tests for Randomness of the Residuals 


The tests (b), (c), (d), (e), and (f) of Section 1.6 can be carried out using the pro- 
gram ITSM by selecting Statistics>Residual Analysis>Tests of Random- 
ness. 

Applying these tests to the residuals from the ARMA(1,1) model for the mean- 
corrected lake data (Example 5.2.5), and using the default value h = 22 suggested 
for the portmanteau tests, we obtain the following results: 
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5.4 Forecasting 


Example 5.4.1 


RANDOMNESS TEST STATISTICS 
LJUNG-BOX PORTM. = 10.23 CHISQUR(20) p=.964 
MCLEOD-LI PORTM. = 16.55 CHISQUR(22) p=.788 
TURNING POINTS = 69 ANORMAL(64.0, 4.14**2) p=.227 
DIFFERENCE-SIGN = 50 ANORMAL(48.5, 2.87**2) p=.602 
RANK TEST = 2083 ANORMAL(2376, 488.7**2) p=.072 
JARQUE-BERA=.285 CHISQUR(2) p=.867 

ORDER OF MIN AICC YW MODEL FOR RESIDUALS = 0 


This table shows the observed values of the statistics defined in Section 1.6, with each 
followed by its large-sample distribution under the null hypothesis of iid residuals, 
and the corresponding p-values. The observed values can thus be checked easily 
for compatibility with their distributions under the null hypothesis. Since all of the 
p-values are greater than .05, none of the test statistics leads us to reject the null 
hypothesis at this level. The order of the minimum AICC autoregressive model for 
the residuals also suggests the compatibility of the residuals with white noise. 

A rough check for normality is provided by visual inspection of the histogram 
of the rescaled residuals, obtained by selecting the third green button at the top of 
the ITSM window. A Gaussian qq-plot of the residuals can also be plotted by select- 
ing Statistics > Residual Analysis > QQ-Plot (normal). No obvious deviation from 
normality is apparent in either the histogram or the qq-plot. The Jarque-Bera statis- 
tic, n[m3/(6m}) + (m4/m} — 3)? /24], where m, = Y7_,(¥; — Y)"/n, is distributed 
asymptotically as x7(2) if {Y,} ~ IID N(w, o°). This hypothesis is rejected if the 
statistic is sufficiently large (at level «œ if the p-value of the test is less than a). In 
this case the large p-value computed by ITSM provides no evidence for rejecting the 
normality hypothesis. 


Once a model has been fitted to the data, forecasting future values of the time series 
can be carried out using the method described in Section 3.3. We illustrate this method 
with one of the examples from Section 3.2. 


For the overshort data {X,} of Example 3.2.8, selection of the ITSM option Model> 
Estimation>Preliminary and the innovations algorithm, followed by Model> 
Estimation>Max likelihood, leads to the maximum likelihood MA(1) model for 
{X,} 


X,+4.035 = Z, — .818Z,1, {Z,} ~ WN(O, 2040.75). (5.4.1) 


To predict the next 7 days of overshorts, we treat (5.4.1) as the true model for the 
data, and use the results of Example 3.3.3 with ¢ = 0. From (3.3.11), the predictors 
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Table 5.1 


are given by 


1 
Ps7X574n = —4.035 + X Osin Lj (Xsis j — X574h i) 
= 


—4.035 + 651.1 (Xs — X57), ifh =1, 


—4.035, ifh > 1, 
with mean squared error 


2040.75rs7, ifh =1, 


E(Xs74n — Ps7X514n) = ae 
2040.75(1 + (—.818)°),  ifh > 1, 


where 657 ; and rs7 are computed recursively from (3.3.9) with 6 = —.818. 

These calculations are performed with ITSM by fitting the maximum likelihood 
model (5.4.1), selecting Forecasting>ARMA, and specifying the number of forecasts 
required. The 1-step, 2-step, ..., and 7-step forecasts of X, are shown in Table 5.1. 
Notice that the predictor of X, for t > 59 is equal to the sample mean, since under 
the MA(1) model {X;, t > 59} is uncorrelated with {X,, t < 57}. 


Assuming that the innovations {Z,} are normally distributed, an approximate 95% 
prediction interval for X64 is given by 


—4.0351 + 1.96 x 58.3602 = (—118.42, 110.35). 


The mean squared errors of prediction, as computed in Section 3.3 and the exam- 
ple above, are based on the assumption that the fitted model is in fact the true model 
for the data. As a result, they do not reflect the variability in the estimation of the 
model parameters. To illustrate this point, suppose the data X,..., X, are generated 
from the causal AR(1) model 


X, =X +Z, {Z} ~ iid (0,07). 
Forecasts of the next 7 observations 


of the overshort data of Example 
3.2.8 using model (5.4.1). 


# XHAT SQRT(MSE) XHAT + MEAN 
58 1.0097 45.1753 —3.0254 
59 0.0000 58.3602 —4.0351 
60 0.0000 58.3602 —4.0351 
61 0.0000 58.3602 —4.0351 
62 0.0000 58.3602 —4.0351 
63 0.0000 58.3602 —4.0351 


64 0.0000 58.3602 —4.0351 
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If $ is the maximum likelihood estimate of ¢, based on X;, ..., Xn, then the one-step 
ahead forecast of X,,41 is Xn, which has mean squared error 


E (Xni - OX,) =E((@-4)X,+ Za), = E(¢—@)X, +0%. (5.4.2) 


The second equality follows from the independence of Z„+ı and (9. X d . To evaluate 
the first term in (5.4.2), first condition on X, and then use the approximations 


E (( -$) ix) ~E(¢-) ~ (1-4) /n. 


where the second relation comes from the formula for the asymptotic variance of ¢ 
given by o°TT' = (1 — ¢°) (see Example 5.2.1). The one-step mean squared error 
is then approximated by 

=i 


E($-$) EX +o xn (1-9) (1 -0o +0? H 

Thus, the error in parameter estimation contributes the term o”/n to the mean squared 
error of prediction. If the sample size is large, this factor is negligible, and so for the 
purpose of mean squared error computation, the estimated parameters can be treated 
as the true model parameters. On the other hand, for small sample sizes, ignoring 
parameter variability can lead to a severe underestimate of the actual mean squared 


error of the forecast. 


5.5 Order Selection 


Once the data have been transformed (e.g., by some combination of Box—Cox and 
differencing transformations or by removal of trend and seasonal components) to 
the point where the transformed series {X,} can potentially be fitted by a zero-mean 
ARMA model, we are faced with the problem of selecting appropriate values for the 
orders p and q. 

It is not advantageous from a forecasting point of view to choose p and q arbi- 
trarily large. Fitting a very high order model will generally result in a small estimated 
white noise variance, but when the fitted model is used for forecasting, the mean 
squared error of the forecasts will depend not only on the white noise variance of the 
fitted model but also on errors arising from estimation of the parameters of the model 
(see the paragraphs following Example 5.4.1). These will be larger for higher-order 
models. For this reason we need to introduce a “penalty factor” to discourage the 
fitting of models with too many parameters. 

Many criteria based on such penalty factors have been proposed in the literature, 
since the problem of model selection arises frequently in statistics, particularly in 
regression analysis. We shall restrict attention here to a brief discussion of the FPE, 
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AIC, and BIC criteria of Akaike and a bias-corrected version of the AIC known as 
the AICC. 


5.5.1 The FPE Criterion 


The FPE criterion was developed by Akaike (1969) to select the appropriate order 
of an AR process to fit to a time series {X,,..., X,}. Instead of trying to choose the 
order p to make the estimated white noise variance as small as possible, the idea is to 
choose the model for {X,} in such a way as to minimize the one-step mean squared 
error when the model fitted to {X,} is used to predict an independent realization {Y;} 
of the same process that generated {X;}. 


Suppose then that {X,,..., X,,} is a realization of an AR(p) process with coef- 
ficients $),...,@,, p < n, and that {Y,,..., Y„} is an independent realization of the 
same process. If ¢;,...,@,, are the maximum likelihood estimators of the coeffi- 
cients based on {X1,..., X,} and if we use these to compute the one-step predictor 
OY, +--+ + bpYn4i—p Of Yn41, then the mean square prediction error is 

A n 2 
E (Yas = Qı Yn TAE A Doit) 
x A 2 
= E | Ypsi = Ya — +++ = Pp Yari- — ($i — $1) Ya — 2+ = (8p — bo) Fost 


=0? +E (o>, — by) Moira jas (4, = $)] ’ 


where P, = (1, ..-, bp)’, ¢,, = (4. ne $y) , and o? is the white noise variance 
of the AR(p) model. Writing the last term in the preceding equation as the expecta- 


tion of the conditional expectation given X,,..., Xn, and using the independence of 
{X,,..., Xn} and {Y,,..., Y,}, we obtain 


E (cam g Qi Yn Sr ip beauty) = o? + E (4p = by) Tp (4, a $)] ’ 


where T, = E[Y,Y,]? j-1- We can approximate the last term by assuming that 


oe (4 p-o i) has its large-sample distribution N(0, o°I’,') from Example 5.21. 
Using Problem 5.13, this gives 


E (Yas e eee TE xo? (1 ¥ 2) l (5.5.1) 


If G is the maximum likelihood estimator of o°, then for large n, nô? /o? is distributed 
approximately as chi-squared with (n — p) degrees of freedom (see TSTM, Section 
8.9). We therefore replace o? in (5.5.1) by the estimator nô?/(n — p) to get the 
estimated mean square prediction error of Y,,41, 


2n +P 


FPE, = ô 
n=p 


(5.5.2) 
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Table 5.2 & and FPE, for AR(p) models fitted 
to the lake data. 
2 FPE 
p o5 P 
0 1.7203 1.7203 
1 0.5097 0.5202 
2 0.4790 0.4989 
3 0.4728 0.5027 
4 0.4708 0.5109 
5 0.4705 0.5211 
6 0.4705 0.5318 
7 0.4679 0.5399 
8 0.4664 0.5493 
9 0.4664 0.5607 
10 0.4453 0.5465 
To apply the FPE criterion for autoregressive order selection we therefore choose 
the value of p that minimizes FPE, as defined in (5.5.2). 
Example 5.5.1 FPE-based selection of an AR model for the lake data 


In Example 5.1.4 we fitted AR(2) models to the mean-corrected lake data, the order 2 
being suggested by the sample PACF shown in Figure 5.4. To use the FPE criterion to 
select p, we have shown in Table 5.2 the values of FPE for values of p from 0 to 10. 
These values were found using ITSM by fitting maximum likelihood AR models with 
the option Model>Estimation>Max likelihood. Also shown in the table are the 
values of the maximum likelihood estimates of o? for the same values of p. Whereas 
ô decreases steadily with p, the values of FPE, have a clear minimum at p = 2, 
confirming our earlier choice of p = 2 as the most appropriate for this data set. 


5.5.2 The AICC Criterion 


A more generally applicable criterion for model selection than the FPE is the infor- 
mation criterion of Akaike (1973), known as the AIC. This was designed to be an 
approximately unbiased estimate of the Kullback—Leibler index of the fitted model 
relative to the true model (defined below). Here we use a bias-corrected version of 
the AIC, referred to as the AICC, suggested by Hurvich and Tsai (1989). 

If X is an n-dimensional random vector whose probability density belongs to 
the family {f (; Y), Y € Y}, the Kullback—Leibler discrepancy between f(-; w) and 
F(; 0) is defined as 


d(w|0) = A(410) — A@|8), 
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where 
AIO) = Eo(—2in FOX: wy) = |., —2InC FO: W) Fox) dx 


is the Kullback—Leibler index of f(-; y) relative to f(-; 6). (Note that in general, 
A(W|0) 4 A(0|Y).) By Jensen’s inequality (see, e.g., Mood et al., 1974), 


dílo) = k -21n (= >) f(x: 6) dx 


EV) a. 
> -2m( K FRO) f; 0) ax) 


= —2]n (f fw) ax) 


= 0, 


with equality holding if and only if f(x; Y) = f(x; @). 

Given observations X,,..., X, of an ARMA process with unknown parameters 
0 = (6, o°), the true model could be identified if it were possible to compute the 
Kullback—Leibler discrepancy between all candidate models and the true model. Since 
this is not possible, we estimate the Kullback—Leibler discrepancies and choose the 
model whose estimated discrepancy (or index) is minimum. In order to do this, 
we assume that the true model and the alternatives are all Gaussian. Then for any 
given 0 = (6B, o°), f(-; 0) is the probability density of (Y\,..., Y,)’, where {Y,} is 
a Gaussian ARMA(p, q) process with coefficient vector 3 and white noise variance 
oa”. (The dependence of 6 on p and q is through the dimension of the autoregressive 
and moving-average coefficients in 6.) 

Suppose, therefore, that our observations X,,..., X„ are from a Gaussian ARMA 
process with parameter vector 0 = (8 ; o?) and assume for the moment that the true 


order is (p, q). Let ô = (È, ô?) be the maximum likelihood estimator of 6 based on 
X\,..., Xn and let Y;,..., Y, be an independent realization of the true process (with 
parameter 0). Then 


-21n Ly (b. ô?) = -21n Ly (3. 6?) 4+ 6-Sy (3) = 


where Ly, Ly, Sx, and Sy are defined as in (5.2.9) and (5.2.11). Hence, 


Eo(A(6|8)) = Epo (-2 indy (8. *)) 


Ss (—2In ibe (b. ô?)) + Ese | — < | -7 653) 
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It can be shown using large-sample approximations (see TSTM, Section 9.3 for 
details) that 


Sr (8)\ _2@+44+Dn 
ô? n-p-q-2 


Ego? 


from which we see that —2 In Lx (8, 67) +2(p +4 + 1)n/(n-— p — q —2) is an ap- 
proximately unbiased estimator of the expected Kullback-Leibler index E,(A(61)) 
in (5.5.3). Since the preceding calculations (and the maximum likelihood estimators 
3 and 62) are based on the assumption that the true order is (p, q), we therefore select 
the values of p and q for our fitted model to be those that minimize AICC (3), where 


AICC(Q) := —2 In Ly (G, Sx(B)/n) + 2(p +q + Dn/(n- p—q—2). (5.5.4) 
The AIC statistic, defined as 
AIC() := —21n Lx (GB, Sx(6)/n)+2(p +4 +1), 


can be used in the same way. Both AICC(, o?) and AIC(G, o?) can be defined 
for arbitrary o? by replacing Sy ()/n in the preceding definitions by o°. The value 
Sx(B)/n is used in (5.5.4), since AICC(, o°) (like AIC(G, o)) is minimized for 
any given 8 by setting o° = Sx(B)/n. 

For fitting autoregressive models, Monte Carlo studies (Jones, 1975; Shibata, 
1976) suggest that the AIC has a tendency to overestimate p. The penalty factors 
2(p +4 +1)n/(n — p — q — 2) and 2(p + q + 1) for the AICC and AIC statistics 
are asymptotically equivalent as n —> oo. The AICC statistic, however, has a more 
extreme penalty for large-order models, which counteracts the overfitting tendency 
of the AIC. The BIC is another criterion that attempts to correct the overfitting nature 
of the AIC. For a zero-mean causal invertible ARMA(p, q) process, it is defined 
(Akaike, 1978) to be 


BIC = (n — p — q) ln [n6?°/(n — p - 4)] 4 n(1 i In v27) 


toron (Xx -n) serol; (5.5.5) 
t=1 


where ô? is the maximum likelihood estimate of the white noise variance. 

The BIC is a consistent order-selection criterion in the sense that if the data 
{X,,..., Xn} are in fact observations of an ARMA(p, q) process, and if p and g are 
the estimated orders found by minimizing the BIC, then p —> p and g —> q with 
probability 1 asm — oo (Hannan, 1980). This property is not shared by the AICC or 
AIC. On the other hand, order selection by minimization of the AICC, AIC, or FPE 
is asymptotically efficient for autoregressive processes, while order selection by BIC 
minimization is not (Shibata, 1980; Hurvich and Tsai, 1989). Efficiency is a desirable 
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Example 5.5.2 


Problems 


property defined in terms of the one-step mean square prediction error achieved by 
the fitted model. For more details see TSTM, Section 9.3. 

In the modeling of real data there is rarely such a thing as the “true order.” For the 
process X, = Do WiZ; there may be many polynomials 0 (z), ġ (z) such that the 
coefficients of z/ in 6(z)/(z) closely approximate y; for moderately small values 
of j. Correspondingly, there may be many ARMA processes with properties similar 
to {X;,}. This problem of identifiability becomes much more serious for multivariate 
processes. The AICC criterion does, however, provide us with a rational criterion for 
choosing among competing models. It has been suggested (Duong, 1984) that models 
with AIC values within c of the minimum value should be considered competitive 
(with c = 2 as a typical value). Selection from among the competitive models can 
then be based on such factors as whiteness of the residuals (Section 5.3) and model 
simplicity. 

We frequently have occasion, particularly in analyzing seasonal data, to fit 
ARMA(p, q) models in which all except m(< p + q) of the coefficients are con- 
strained to be zero. In such cases the definition (5.5.4) is replaced by 


AICC(@) := —21n Lx (GB, Sx(B)/n) + 2(m + 1)n/(n — m — 2). (5.5.6) 


Models for the lake data 


In Example 5.2.4 we found that the minimum-AICC ARMA(p,q) model for 
the mean-corrected lake data is the ARMA(1,1) model (5.2.14). For this model 
ITSM gives the values AICC = 212.77 and BIC = 216.86. A systematic check 
on ARMA(p, q) models for other values of p and q shows that the model (5.2.14) 
also minimizes the BIC statistic. The minimum-AICC AR(p) model is found to be 
the AR(2) model satisfying 


X, — 1.0441 X,_) + .2503X,_2 = Z,, {Z} ~ WN, 0.4789), 


with AICC = 213.54 and BIC = 217.63. Both the AR(2) and ARMA(1,1) models 
pass the diagnostic checks of Section 5.3, and in view of the small difference between 
the AICC values there is no strong reason to prefer one model or the other. 


5.1. The sunspot numbers {X;,t = 1,..., 100}, filed as SUNSPOTS.TSM, have 
sample autocovariances 7(0) = 1382.2, p(1) = 1114.4, y(2) = 591.73, and 
y (3) = 96.216. Use these values to find the Yule—Walker estimates of $1, ¢o, 
and øg? in the model 


Y, = HY +Y, 2+ Z, {Z} ~ WN (0,02), 
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5.2. 


5.3. 


5.4. 


for the mean-corrected series Y, = X, — 46.93,f = 1,...,100. Assuming 
that the data really are a realization of an AR(2) process, find 95% confidence 
intervals for ġı and g2. 


From the information given in the previous problem, use the Durbin—Levinson 
algorithm to compute the sample partial autocorrelations bu, dn, and $33 of 
the sunspot series. Is the value of $3; compatible with the hypothesis that the 
data are generated by an AR(2) process? (Use significance level .05.) 


Consider the AR(2) process {X,} satisfying 
X,-—$X,-1-¢°Xi2=Z,, {Z} ~WN(0,0’). 
a. For what values of ¢ is this a causal process? 
b. The following sample moments were computed after observing X,,..., X200: 
y(0) = 6.06, p(1) = .687. 


Find estimates of ¢ and o° by solving the Yule—Walker equations. (If you 
find more than one solution, choose the one that is causal.) 


Two hundred observations of a time series, X,..., X200, gave the following 
sample statistics: 


sample mean: X200 = 3.82; 


sample variance: (0) = 1.15; 


sample ACF: p(1) = .427; 
p(2) = 475; 
p(3) = .169. 


a. Based on these sample statistics, is it reasonable to suppose that {X, — u} is 
white noise? 


b. Assuming that {X, — u} can be modeled as the AR(2) process 
Xı — u — (Xi — U) — b2(Xi-2 — H) = Z;, 
where {Z,} ~ IID(0, o°), find estimates of u, $1, $2, and o°. 
c. Would you conclude that u = 0? 


d. Construct 95% confidence intervals for ¢; and @p. 


e. Assuming that the data were generated from an AR(2) model, derive esti- 
mates of the PACF for all lags h > 1. 
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5.5. 


5.6. 


5.7. 
5.8. 


5.9. 


5.10. 


Use the program ITSM to simulate and file 20 realizations of length 200 of the 
Gaussian MA(1) process 


X= Z +0Z- {Z,} ~ WNO, 1), 


with 0 = 0.6. 
a. For each series find the moment estimate of @ as defined in Example 5.1.2. 


b. For each series use the innovations algorithm in the ITSM option Mode1> 
Estimation>Preliminary to find an estimate of 0. (Use the default value 
of the parameter m.) As soon as you have found this preliminary estimate 
for a particular series, select Model>Estimation>Max likelihood to find 
the maximum likelihood estimate of 6 for the series. 

c. Compute the sample means and sample variances of your three sets of esti- 
mates. 

d. Use the asymptotic formulae given at the end of Section 5.1.1 (with n = 
200) to compute the variances of the moment, innovation, and maximum 
likelihood estimators of 6. Compare with the corresponding sample variances 
found in (c). 

e. What do the results of (c) suggest concerning the relative merits of the three 
estimators? 


Establish the recursions (5.1.19) and (5.1.20) for the forward and backward 
prediction errors u;(t) and v; (t) in Burg’s algorithm. 


(B) 
ll 


Derive the recursions for the Burg estimates ¢;; OY 


and o; 


From the innovation form of the likelihood (5.2.9) derive the equations (5.2.10), 
(5.2.11), and (5.2.12) for the maximum likelihood estimators of the parameters 
of an ARMA process. 


Use equation (5.2.9) to show that for n > p, the likelihood of the observations 
{X,,..., Xn} of the causal AR(p) process defined by 


X, = QX- + + bpXi-p + Zi, {Z} ~ WN (0, 0°), 
is 


L ($, 0°) = (2107) "” (det Gp)“ 


1 , E n 
x exp l-z x0; + ye (X; — PiXie sxn || ; 
t=p+1 
where X, = (X1, ..., Xp) and G, =o "Tp = o *E(XpX',). 


Use the result of Problem 5.9 to derive a pair of linear equations for the least 
squares estimates of ¢, and ø, for a causal AR(2) process (with mean zero). 
Compare your equations with those for the Yule-Walker estimates. (Assume 
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that the mean is known to be zero in writing down the latter equations, so that 
the sample autocovariances are y(h) = 1 y Xi+nXı for h > 0.) 


5.11. Given two observations x, and x, from the causal AR(1) process satisfying 
X,=$X,1+Z,, {Z} ~WN (0,07), 


and assuming that |x;| Æ |x|, find the maximum likelihood estimates of @ 
and o°. 


5.12. Derive a cubic equation for the maximum likelihood estimate of the coefficient 


ġ of a causal AR(1) process based on the observations X,..., Xn. 


5.13. Use the result of Problem A.7 and the approximate large-sample normal distri- 
bution of the maximum likelihood estimator @, to establish the approximation 
(5.5.1). 
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Nonstationary and Seasonal 
Time Series Models 


6.1 ARIMA Models for Nonstationary Time Series 
6.2 Identification Techniques 

6.3 Unit Roots in Time Series Models 

6.4 Forecasting ARIMA Models 

6.5 Seasonal ARIMA Models 

6.6 Regression with ARMA Errors 


In this chapter we shall examine the problem of finding an appropriate model for a 
given set of observations {x,, ..., x,} that are not necessarily generated by a stationary 
time series. If the data (a) exhibit no apparent deviations from stationarity and (b) 
have a rapidly decreasing autocovariance function, we attempt to fit an ARMA model 
to the mean-corrected data using the techniques developed in Chapter 5. Otherwise, 
we look first for a transformation of the data that generates a new series with the 
properties (a) and (b). This can frequently be achieved by differencing, leading us 
to consider the class of ARIMA (autoregressive integrated moving-average) models, 
defined in Section 6.1. We have in fact already encountered ARIMA processes. The 
model fitted in Example 5.1.1 to the Dow Jones Utilities Index was obtained by fitting 
an AR model to the differenced data, thereby effectively fitting an ARIMA model to 
the original series. In Section 6.1 we shall give a more systematic account of such 
models. 

In Section 6.2 we discuss the problem of finding an appropriate transformation for 
the data and identifying a satisfactory ARMA(p, q) model for the transformed data. 
The latter can be handled using the techniques developed in Chapter 5. The sample 
ACF and PACF and the preliminary estimators bn and @,, of Section 5.1 can provide 
useful guidance in this choice. However, our prime criterion for model selection will 
be the AICC statistic discussed in Section 5.5.2. To apply this criterion we compute 
maximum likelihood estimators of @, 0, and o? for a variety of competing p and q 
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values and choose the fitted model with smallest AICC value. Other techniques, in 
particular those that use the R and S arrays of Gray et al. (1978), are discussed in 
the survey of model identification by de Gooijer et al. (1985). If the fitted model is 
satisfactory, the residuals (see Section 5.3) should resemble white noise. Tests for this 
were described in Section 5.3 and should be applied to the minimum AICC model 
to make sure that the residuals are consistent with their expected behavior under 
the model. If they are not, then competing models (models with AICC value close 
to the minimum) should be checked until we find one that passes the goodness of 
fit tests. In some cases a small difference in AICC value (say less than 2) between 
two satisfactory models may be ignored in the interest of model simplicity. In Sec- 
tion 6.3 we consider the problem of testing for a unit root of either the autoregressive 
or moving-average polynomial. An autoregressive unit root suggests that the data 
require differencing, and a moving-average unit root suggests that they have been 
overdifferenced. Section 6.4 considers the prediction of ARIMA processes, which 
can be carried out using an extension of the techniques developed for ARMA pro- 
cesses in Sections 3.3 and 5.4. In Section 6.5 we examine the fitting and prediction 
of seasonal ARIMA (SARIMA) models, whose analysis, except for certain aspects 
of model identification, is quite analogous to that of ARIMA processes. Finally, we 
consider the problem of regression, allowing for dependence between successive 
residuals from the regression. Such models are known as regression models with 
time series residuals and often occur in practice as natural representations for data 
containing both trend and serially dependent errors. 


6.1 ARIMA Models for Nonstationary Time Series 


Definition 6.1.1 


We have already discussed the importance of the class of ARMA models for repre- 
senting stationary series. A generalization of this class, which incorporates a wide 
range of nonstationary series, is provided by the ARIMA processes, i.e., processes 
that reduce to ARMA processes when differenced finitely many times. 


If d is a nonnegative integer, then {X,} is an ARIMA(p,d,q) process if Y, := 
(1 — B)! X, is a causal ARMA(p, q) process. 


This definition means that {X,} satisfies a difference equation of the form 
*(B)X, = o(B)(1 — B)’ X, = 0(B)Z,, {Z} ~ WN (0, 0°), (6.1.1) 


where ¢ (z) and 0 (z) are polynomials of degrees p and q, respectively, and ¢(z) 4 0 
for |z| < 1. The polynomial ¢*(z) has a zero of order d at z = 1. The process {X;} is 
stationary if and only if d = 0, in which case it reduces to an ARMA (p, q) process. 

Notice thatif d > 1, we can add an arbitrary polynomial trend of degree (d — 1) to 
{X,} without violating the difference equation (6.1.1). ARIMA models are therefore 
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Example 6.1.1 


Figure 6-1 

200 observations of the 
ARIMA(1,1,0) series 

X, of Example 6.1.1. 


useful for representing data with trend (see Sections 1.5 and 6.2). It should be noted, 
however, that ARIMA processes can also be appropriate for modeling series with no 
trend. Except when d = 0, the mean of {X,} is not determined by equation (6.1.1), 
and it can in particular be zero (as in Example 1.3.3). Since for d > 1, equation 
(6.1.1) determines the second-order properties of {(1 — B)! X,} but not those of {X;,} 
(Problem 6.1), estimation of œ, 0, and o? will be based on the observed differences 
(1 — B)*X,. Additional assumptions are needed for prediction (see Section 6.4). 


{X,} is an ARIMA(1,1,0) process if for some ¢ € (—1, 1), 


(1—@B)(1—B)X,=Z,, {Z} ~ WN (0, 0°). 


We can then write 
t 
X=Xot oY, t21, 
j=l 


where 


(oe) 


Y¥,=(1—B)X,=) @/Z,_;. 

j=0 
A realization of {X,,..., X20} with Xo = 0, ¢ = 0.8, and o? = 1 is shown in 
Figure 6.1, with the corresponding sample autocorrelation and partial autocorrelation 
functions in Figures 6.2 and 6.3, respectively. 


A distinctive feature of the data that suggests the appropriateness of an ARIMA 
model is the slowly decaying positive sample autocorrelation function in Figure 6.2. 
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Figure 6-2 
The sample ACF of the 
data in Figure 6.1. 


Figure 6-3 
The sample PACF of 
the data in Figure 6.1. 
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If, therefore, we were given only the data and wished to find an appropriate model, it 
would be natural to apply the operator V = 1— B repeatedly in the hope that for some 
j, {V/X;,} will have a rapidly decaying sample autocorrelation function compatible 
with that of an ARMA process with no zeros of the autoregressive polynomial near the 
unit circle. For this particular time series, one application of the operator V produces 
the realization shown in Figure 6.4, whose sample ACF and PACF (Figures 6.5 and 
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Figure 6-4 

199 observations of the 
series Y; = VX; with 
{X} as in Figure 6.1. 


6.6) suggest an AR(1) (or possibly AR(2)) model for {V X,}. The maximum likelihood 
estimates of ¢ and o? obtained from ITSM under the assumption that E (V X,) = 0 
(found by not subtracting the mean after differencing the data) are .808 and .978, 
respectively, giving the model 


(1 — 0.808B)(1 — B)X; = Z,, {Z,} ~ WN(O, 0.978), (6.1.2) 
which bears a close resemblance to the true underlying process, 
(1 — 0.8B)(1 — B)X,=Z,, {Z} ~ WN(, 1). (6.1.3) 


Instead of differencing the series in Figure 6.1 we could proceed more directly by 
attempting to fit an AR(2) process as suggested by the sample PACF of the original 
series in Figure 6.3. Maximum likelihood estimation, carried out using ITSM after 
fitting a preliminary model with Burg’s algorithm and assuming that EX, = 0, gives 
the model 


(1 — 1.808B + 0.811B?)X, = (1 — 0.825B)(1 — .983B)X, = Z,, 
{Z,} ~ WN(0, 0.970), (6.1.4) 


which, although stationary, has coefficients closely resembling those of the true non- 
stationary process (6.1.3). (To obtain the model (6.1.4), two optimizations were 
carried out using the Model>Estimation>Max likelihood option of ITSM, the 
first with the default settings and the second after setting the accuracy parameter to 
0.00001.) 

From a sample of finite length it will be extremely difficult to distinguish between 
a nonstationary process such as (6.1.3), for which ¢*(1) = 0, and a process such as 
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Figure 6-5 
The sample ACF of the 
series { Y;} in Figure 6.4. 


Figure 6-6 
The sample PACF of the 
series { Y;} in Figure 6.4. 
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(6.1.4), which has very similar coefficients but for which ¢* has all of its zeros outside 
the unit circle. In either case, however, if it is possible by differencing to generate a 
series with rapidly decaying sample ACF, then the differenced data set can be fitted 
by a low-order ARMA process whose autoregressive polynomial ¢* has zeros that 
are comfortably outside the unit circle. This means that the fitted parameters will 
be well away from the boundary of the allowable parameter set. This is desirable 
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Figure 6-7 

200 observations of 

the AR(2) process 
defined by (6.1.6) with 
r = 1.005 and w = 7/3. 


for numerical computation of parameter estimates and can be quite critical for some 
methods of estimation. For example, if we apply the Yule-Walker equations to fit an 
AR(2) model to the data in Figure 6.1, we obtain the model 


(1 — 1.282B + 0.290B7)X, = Z,,  {Z,} ~ WN(0, 6.435), (6.1.5) 


which bears little resemblance to either the maximum likelihood model (6.1.4) or the 
true model (6.1.3). In this case the matrix R, appearing in (5.1.7) is nearly singular. 

An obvious limitation in fitting an ARIMA(), d, q) process {X,} to data is that 
{X,} is permitted to be nonstationary only in a very special way, i.e., by allowing the 
polynomial ¢*(B) in the representation ¢*(B)X, = Z, to have a zero of multiplicity 
d at the point 1 on the unit circle. Such models are appropriate when the sample ACF 
is a Slowly decaying positive function as in Figure 6.2, since sample autocorrelation 
functions of this form are associated with models ¢*(B)X, = 6(B)Z, in which ¢* 
has a zero either at or close to 1. 

Sample autocorrelations with slowly decaying oscillatory behavior as in Figure 
6.8 are associated with models ¢*(B)X, = 9(B)Z, in which ¢* has a zero close to e!” 
for some w € (—Z, 7] other than 0. Figure 6.8 is the sample ACF of the series of 200 
observations in Figure 6.7, obtained from ITSM by simulating the AR(2) process 


X, — (2r! cosw)X;-1 +r? X;-2 = Z, {Z} ~ WNO, 1), (6.1.6) 
with r = 1.005 and w = 7/3, i.e., 
X, — 0.9950X,_; + 0.9901 X,2 = Z, {Z} ~ WN(O, 1). 
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Figure 6-8 
The sample ACF of the 
data in Figure 6.7. 
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The autocorrelation function of the model (6.1.6) can be derived by noting that 
1 — (2r7' cosw) B + r° B? = (1 — r'e? B) (1—r7'eB) (6.1.7) 

and using (3.2.12). This gives 
_, Sin(hw + y) 


ph) =r : >, h0, (6.1.8) 
sin y 
where 
2 
1 
tany = baa tan w. (6.1.9) 
r?—1 
It is clear from these equations that 
p(h) > cos(hw) asr | 1. (6.1.10) 


With r = 1.005 and w = z/3 as in the model generating Figure 6.7, the model 
ACF (6.1.8) is a damped sine wave with damping ratio 1/1.005 and period 6. These 
properties are reflected in the sample ACF shown in Figure 6.8. For values of r closer 
to 1, the damping will be even slower as the model ACF approaches its limiting form 
(6.1.10). 

If we were simply given the data shown in Figure 6.7, with no indication of 
the model from which it was generated, the slowly damped sinusoidal sample ACF 
with period 6 would suggest trying to make the sample ACF decay more rapidly 
by applying the operator (6.1.7) with r = 1 and w = 7/3, i.e., (1 — B + B°). If it 
happens, as in this case, that the period 277/w is close to some integer s (in this case 
6), then the operator 1 — B* can also be applied to produce a series with more rapidly 
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Figure 6-9 

The sample ACF of 
(1 — B+ B?)X, with 
{X} as in Figure 6.7. 


decaying autocorrelation function (see also Section 6.5). Figures 6.9 and 6.10 show 
the sample autocorrelation functions obtained after applying the operators 1 — B + B? 
and 1 — B®, respectively, to the data shown in Figure 6.7. For either one of these two 
differenced series, it is then not difficult to fit an ARMA model $(B)X, = 0(B)Z, 
for which the zeros of ¢ are well outside the unit circle. Techniques for identifying 
and determining such ARMA models have already been introduced in Chapter 5. For 
convenience we shall collect these together in the following sections with a number 
of illustrative examples. 


Techniques 


(a) Preliminary Transformations. The estimation methods of Chapter 5 enable us to 
find, for given values of p and q, an ARMA(p, q) model to fit a given series of data. 
For this procedure to be meaningful it must be at least plausible that the data are in 
fact a realization of an ARMA process and in particular a realization of a stationary 
process. If the data display characteristics suggesting nonstationarity (e.g., trend and 
seasonality), then it may be necessary to make a transformation so as to produce a 
new Series that is more compatible with the assumption of stationarity. 

Deviations from stationarity may be suggested by the graph of the series itself or 
by the sample autocorrelation function or both. 

Inspection of the graph of the series will occasionally reveal a strong depen- 
dence of variability on the level of the series, in which case the data should first be 
transformed to reduce or eliminate this dependence. For example, Figure 1.1 shows 
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Figure 6-10 

The sample ACF 

of (1 — B°)X, with 
{X} as in Figure 6.7. 
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the Australian monthly red wine sales from January 1980 through October 1991, 
and Figure 1.17 shows how the increasing variability with sales level is reduced 
by taking natural logarithms of the original series. The logarithmic transformation 
V, = ln U, used here is in fact appropriate whenever {U;} is a series whose standard 
deviation increases linearly with the mean. For a systematic account of a general class 
of variance-stabilizing transformations, we refer the reader to Box and Cox (1964). 
The defining equation for the general Box—Cox transformation f, is 


a! (U> — 1), U,>=0,r>0, 
fU) = 
ln U,, U, > 0,4 =0, 


and the program ITSM provides the option (Transform>Box-Cox) of applying fy 
(with 0 < A < 1.5) prior to the elimination of trend and/or seasonality from the data. 
In practice, if a Box—Cox transformation is necessary, it is often the case that either 
fo or fos is adequate. 

Trend and seasonality are usually detected by inspecting the graph of the (possibly 
transformed) series. However, they are also characterized by autocorrelation functions 
that are slowly decaying and nearly periodic, respectively. The elimination of trend 
and seasonality was discussed in Section 1.5, where we described two methods: 


i. “classical decomposition” of the series into a trend component, a seasonal com- 
ponent, and a random residual component, and 
ii. differencing. 
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The Australian red 
wine data after taking 
natural logarithms and 
removing a seasonal 
component of period 
12 and a linear trend. 
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The program ITSM (in the Transform option) offers a choice between these tech- 
niques. The results of applying methods (1) and (ii) to the transformed red wine data 
V, = In U, in Figure 1.17 are shown in Figures 6.11 and 6.12, respectively. Figure 
6.11 was obtained from ITSM by estimating and removing from {V,} a linear trend 
component and a seasonal component with period 12. Figure 6.12 was obtained by 
applying the operator (1 — B”) to {V,}. Neither of the two resulting series displays 
any apparent deviations from stationarity, nor do their sample autocorrelation func- 
tions. The sample ACF and PACF of {(1 — B!?)V,} are shown in Figures 6.13 and 
6.14, respectively. 

After the elimination of trend and seasonality, it is still possible that the sample 
autocorrelation function may appear to be that of a nonstationary (or nearly nonsta- 
tionary) process, in which case further differencing may be carried out. 

In Section 1.5 we also mentioned a third possible approach: 


iii. fitting a sum of harmonics and a polynomial trend to generate a noise sequence 
that consists of the residuals from the regression. 


In Section 6.6 we discuss the modifications to classical least squares regression analy- 
sis that allow for dependence among the residuals from the regression. These modifica- 
tions are implemented in the ITSM option Regression>Estimation>Generalized 
LS. 

(b) Identification and Estimation. Let {X,} be the mean-corrected transformed 
series found as described in (a). The problem now is to find the most satisfactory 
ARMA(p, q) model to represent {X,}. If p and q were known in advance, this would 
be a straightforward application of the estimation techniques described in Chapter 5. 
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Figure 6-12 

The Australian red 
wine data after taking 
natural logarithms and 
differencing at lag 12. 


Figure 6-13 
The sample ACF of the 
data in Figure 6.12. 
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However, this is usually not the case, so it becomes necessary also to identify appro- 
priate values for p and q. 

It might appear at first sight that the higher the values chosen for p and q, the 
better the resulting fitted model will be. However, as pointed out in Section 5.5, 
estimation of too large a number of parameters introduces estimation errors that 
adversely affect the use of the fitted model for prediction as illustrated in Section 5.4. 
We therefore minimize one of the model selection criteria discussed in Section 5.5 in 
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order to choose the values of p and q. Each of these criteria includes a penalty term 
to discourage the fitting of too many parameters. We shall base our choice of p and 
q primarily on the minimization of the AICC statistic, defined as 


AICC(¢, 0) = —2 1n L(ġ, 0, S(ġ, 9)/n) +2(p +q + 1)n/(n—p—q-— 2), (6.2.1) 


where L(¢, 0, o?) is the likelihood of the data under the Gaussian ARMA model with 
parameters (¢, 0, o°), and S(¢, 0) is the residual sum of squares defined in (5.2.11). 
Once a model has been found that minimizes the AICC value, it is then necessary 
to check the model for goodness of fit (essentially by checking that the residuals are 
like white noise) as discussed in Section 5.3. 

For any fixed values of p and q, the maximum likelihood estimates of @ and 0 
are the values that minimize the AICC. Hence, the minimum AICC model (over any 
given range of p and q values) can be found by computing the maximum likelihood 
estimators for each fixed p and g and choosing from these the maximum likelihood 
model with the smallest value of AICC. This can be done with the program ITSM 
by using the option Model>Estimation>Autofit. When this option is selected and 
upper and lower bounds for p and q are specified, the program fits maximum like- 
lihood models for each pair (p, q) in the range specified and selects the model with 
smallest AICC value. If some of the coefficient estimates are small compared with 
their estimated standard deviations, maximum likelihood subset models (with those 
coefficients set to zero) can also be explored. 

The steps in model identification and estimation can be summarized as follows: 


192 


Chapter 6 


Nonstationary and Seasonal Time Series Models 


Example 6.2.1 


e After transforming the data (if necessary) to make the fitting of an ARMA(p, q) 
model reasonable, examine the sample ACF and PACF to get some idea of poten- 
tial p and q values. Preliminary estimation using the ITSM option Model>Esti- 
mation>Preliminary is also useful in this respect. Burg’s algorithm with AICC 
minimization rapidly fits autoregressions of all orders up to 27 and selects the one 
with minimum AICC value. For preliminary estimation of models with g > 0, 
each pair (p, q) must be considered separately. 


e Select the option Model>Estimation>Autofit of ITSM. Specify the required 
limits for p and q, and the program will then use maximum likelihood estimation 
to find the minimum AICC model with p and q in the range specified. 


e Examination of the fitted coefficients and their standard errors may suggest that 
some of them can be set to zero. If this is the case, then a subset model can 
be fitted by clicking on the button Constrain optimization in the Maximum 
Likelihood Estimation dialog box and setting the selected coefficients to 
zero. Optimization will then give the maximum likelihood model with the cho- 
sen coefficients constrained to be zero. The constrained model is assessed by 
comparing its AICC value with those of the other candidate models. 


e Check the candidate model(s) for goodness of fit as described in Section 5.3. 
These tests can be performed by selecting the option Statistics>Residual 
Analysis. 


The Australian red wine data 


Let {X,, ..., X130} denote the series obtained from the red wine data of Example 1.1.1 
after taking natural logarithms, differencing at lag 12, and subtracting the mean 
(0.0681) of the differences. The data prior to mean correction are shown in Fig- 
ure 6.12. The sample PACF of {X,}, shown in Figure 6.14, suggests that an AR(12) 
model might be appropriate for this series. To explore this possibility we use the 
ITSM option Model>Estimation>Preliminary with Burg’s algorithm and AICC 
minimization. As anticipated, the fitted Burg models do indeed have minimum AICC 
when p = 12. The fitted model is 


(1 — .245B — .069B* — .012B* — .021B* — .200B° + .025B° + .004B’ 
—.133B* + .010B° — .095B'° + .118B'! + .384B")X, = Z,, 


with {Z,} ~ WN(O, 0.0135) and AICC value —158.77. Selecting the option Model> 
Estimation>Max likelihood then gives the maximum likelihood AR(12) model, 
which is very similar to the Burg model and has AICC value - 158.87. Inspection of the 
standard errors of the coefficient estimators suggests the possibility of setting those at 
lags 2,3,4,6,7,9,10, and 11 equal to zero. If we do this by clicking on the Constrain 
optimization button in the Maximum Likelihood Estimation dialog box and 


6.3 


Unit Roots in Time Series Models 193 


6.3 


Example 6.2.2 


then reoptimize, we obtain the model, 
(1 — .270B — .224B° — .149B* + .099B"! + .353B'”)X, = Z,, 


with {Z,} ~ WN(O, 0.0138) and AICC value —172.49. 

In order to check more general ARMA(p, q) models, select the option Model> 
Estimation>Autofit and specify the minimum and maximum values of p and 
q to be zero and 15, respectively. (The sample ACF and PACF suggest that these 
limits should be more than adequate to include the minimum AICC model.) In a 
few minutes (depending on the speed of your computer) the program selects an 
ARMA(1,12) model with AICC value —172.74, which is slightly better than the 
subset AR(12) model just found. Inspection of the estimated standard deviations of 
the MA coefficients at lags 1, 3, 4, 6, 7, 9, and 11 suggests setting them equal to zero 
and reestimating the values of the remaining coefficients. If we do this by clicking on 
the Constrain optimization button in the Maximum Likelihood Estimation 
dialog box, setting the required coefficients to zero and then reoptimizing, we obtain 
the model, 


(1 — .286B)X, = (1 + .127B? + .183B° + .177B* + .181B" — .554B"") Z,, 


with {Z,} ~ WN(O, 0.0120) and AICC value —184.09. 

The subset ARMA(1,12) model easily passes all the goodness of fit tests in the 
Statistics>Residual Analysis option. In view of this and its small AICC value, 
we accept it as a plausible model for the transformed red wine series. 


The lake data 


Let {Y,,t = 1,...,99} denote the lake data of Example 1.3.5. We have seen al- 
ready in Example 5.2.5 that the ITSM option Model>Estimation>Autofit gives 
the minimum-AICC model 


X, — 0.7446X,_; = Z, +0.3213Z,_1;, {Z,} ~ WN(O, 0.4750), 


for the mean-corrected series X, = Y, — 9.0041. The corresponding AICC value is 
212.77. Since the model passes all the goodness of fit tests, we accept it as a reasonable 
model for the data. 


Unit Roots in Time Series Models 


The unit root problem in time series arises when either the autoregressive or moving- 
average polynomial of an ARMA model has a root on or near the unit circle. A 
unit root in either of these polynomials has important implications for modeling. 
For example, a root near 1 of the autoregressive polynomial suggests that the data 
should be differenced before fitting an ARMA model, whereas a root near 1 of 
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the moving-average polynomial indicates that the data were overdifferenced. In this 
section, we consider inference procedures for detecting the presence of a unit root in 
the autoregressive and moving-average polynomials. 


6.3.1 Unit Roots in Autoregressions 


In Section 6.1 we discussed the use of differencing to transform a nonstationary time 
series with a slowly decaying sample ACF and values near 1 at small lags into one with 
a rapidly decreasing sample ACF. The degree of differencing of a time series {X,} was 
largely determined by applying the difference operator repeatedly until the sample 
ACF of {v4 Xx, } decays quickly. The differenced time series could then be modeled by 
a low-order ARMA (p, q) process, and hence the resulting ARIMA (p, d, q) model 
for the original data has an autoregressive polynomial (1 —ġiız =: = pz”) (1—z)4 
(see (6.1.1)) with d roots on the unit circle. In this subsection we discuss a more 
systematic approach to testing for the presence of a unit root of the autoregressive 
polynomial in order to decide whether or not a time series should be differenced. This 
approach was pioneered by Dickey and Fuller (1979). 
Let X,,..., X, be observations from the AR(1) model 


X= u= p(X +Z, {Z} ~ WN(0, 0°), (6.3.1) 


where |¢,| < 1 and u = EX,. For large n, the maximum likelihood estimator ĝi ofġi 
is approximately N(¢ı ; ( 1- o?) / n). For the unit root case, this normal approximation 
is no longer applicable, even asymptotically, which precludes its use for testing the 
unit root hypothesis Ho : ¢; = 1 vs. H; : ġı < 1. To construct a test of Ho, write the 
model (6.3.1) as 


VX% =X,- Xn = po T $ Xi-1 + Z,, {Z} ~ WN (0, o°) > (6,3.2) 


where ¢5 = (1 — ¢1) and ¢f = ¢; — 1. Now let or be the ordinary least squares 
(OLS) estimator of $f found by regressing V X, on 1 and X,;_,. The estimated standard 
error of ¢/ is 


n 


-1/2 
— Ny oe 2 
SE (di) =S (2 (X-1 — X) 
t=2 
À x 2 2 

where S? = Dae (vx, — pù - $i X1) /(n — 3) and X is the sample mean of 
Xi, ..., Xn-1. Dickey and Fuller derived the limit distribution as n —> oo of the 
t-ratio 

2, = ĝi /SE (47) (6.3.3) 


under the unit root assumption ¢; = 0, from which a test of the null hypothesis 
Ho : ġı = 1 canbe constructed. The .01, .05, and .10 quantiles of the limit distribution 
of t, (see Table 8.5.2 of Fuller, 1976) are —3.43, —2.86, and —2.57, respectively. 
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The augmented Dickey—Fuller test then rejects the null hypothesis of a unit root, 
at say, level .05 if t,, < —2.86. Notice that the cutoff value for this test statistic is 
much smaller than the standard cutoff value of —1.645 obtained from the normal 
approximation to the f-distribution, so that the unit root hypothesis is less likely to 
be rejected using the correct limit distribution. 

The above procedure can be extended to the case where {X,} follows the AR(p) 
model with mean u given by 


X-u =¢ġi (X1 Ww +: + bp (Xp — Mw) +Z, {Z} ~ WN (0, 0°). 
This model can be rewritten as (see Problem 6.2) 
VX, = OA T iX + $3 VXi-1 eee PV Xip + Zr, (6.3.4) 
where ġo = u (1-ġ1 —---— 4p), 6t = ELi di — Land ġ} = - Yb, j = 
2,..., p. If the autoregressive polynomial has a unit root at 1, then 0 = ¢ (1) = —¢¥, 


and the differenced series {V X,} is an AR(p — 1) process. Consequently, testing the 
hypothesis of a unit root at 1 of the autoregressive polynomial is equivalent to testing 
go; = 0. As in the AR(1) example, ¢/ can be estimated as the coefficient of X,_; in 
the OLS regression of V X, onto 1, X;_1, VX;-1, ..., VX;—p41. For large n the t-ratio 


ê, = O1/SE (4), (6.3.5) 


where SE (i) is the estimated standard error of $*, has the same limit distribution 


as the test statistic in (6.3.3). The augmented Dickey—Fuller test in this case is applied 
in exactly the same manner as for the AR(1) case using the test statistic (6.3.5) and 
the cutoff values given above. 


Consider testing the time series of Example 6.1.1 (see Figure 6.1) for the presence 
of a unit root in the autoregressive operator. The sample PACF in Figure 6.3 sug- 
gests fitting an AR(2) or possibly an AR(3) model to the data. Regressing VX, on 
1, X,-1, VX;_-1, VX;_2 fort = 4,..., 200 using OLS gives 


VX, = .1503 — .0041X,_; + .9335V X,_; — .1548V X,_2 + Z,, 
(.1135) (.0028) (.0707) (.0708) 


where {Z,} ~ WN(0, .9639). The test statistic for testing the presence of a unit root 
is 


— 0041 
oe = —1.464. 
Tu = ~ 9028 


Since —1.464 > —2.57, the unit root hypothesis is not rejected at level .10. In 
contrast, if we had mistakenly used the t-distribution with 193 degrees of freedom as 
an approximation to T,,, then we would have rejected the unit root hypothesis at the 
.10 level (p-value is .074). The t-ratios for the other coefficients, ġġ, 63, and 3, have 
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an approximate f-distribution with 193 degrees of freedom. Based on these t-ratios, 
the intercept should be 0, while the coefficient of V X;_2 is barely significant. The 
evidence is much stronger in favor of a unit root if the analysis is repeated without a 
mean term. The fitted model without a mean term is 


VX, = .0012X,_; + .9395V X,_; — .1585V X,_2 + Z,, 
(.0018) (.0707) (.0709) 


where {Z,} ~ WN(O, .9677). The .01, .05, and .10 cutoff values for the corresponding 
test statistic when a mean term is excluded from the model are —2.58, —1.95, and 
—1.62 (see Table 8.5.2 of Fuller, 1976). In this example, the test statistic is 
—.0012 
0018 
which is substantially larger than the .10 cutoff value of —1.62. 


= —.667, 


t= 


Further extensions of the above test to AR models with p = O (n!) and to 
ARMA (p, q) models can be found in Said and Dickey (1984). However, as reported 
in Schwert (1987) and Pantula (1991), this test must be used with caution if the 
underlying model orders are not correctly specified. 


6.3.2 Unit Roots in Moving Averages 


A unit root in the moving-average polynomial can have a number of interpretations 
depending on the modeling application. For example, let {X,} be a causal and invert- 
ible ARMA(p, q) process satisfying the equations 


$(B)X, =0(B)Z, {Z} ~ WN (0, o°). 


Then the differenced series Y, := VX, is a noninvertible ARMA (p, q + 1) process 
with moving-average polynomial 0 (z)(1 — z). Consequently, testing for a unit root in 
the moving-average polynomial is equivalent to testing that the time series has been 
overdifferenced. 

As a second application, it is possible to distinguish between the competing 
models 


VEX, =a +V, 


and 


X, =co+cit +: +t + We 


where {V,} and {W,} are invertible ARMA processes. For the former model the dif- 
ferenced series {v* X } has no moving-average unit roots, while for the latter model 
{V«X,} has a multiple moving-average unit root of order k. We can therefore distin- 
guish between the two models by using the observed values of {Vix a to test for the 
presence of a moving-average unit root. 
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We confine our discussion of unit root tests to first-order moving-average models, 
the general case being considerably more complicated and not fully resolved. Let 
Xi, ..., Xn be observations from the MA(1) model 


X,=Z,+0Z,-1, {Z} ~ ID (0, o°). 


Davis and Dunsmuir (1996) showed that under the assumption 6 = —1, n (ô +1) (6 is 
the maximum likelihood estimator) converges in distribution. A test of Ho : 6 = —1 
vs. Hı : 0 > —1 can be fashioned on this limiting result by rejecting Hy) when 


6>-1 + Ca/n, 


where c, is the (1 — a) quantile of the limit distribution of n(ô + 1). (From Table 
3.2 of Davis, Chen, and Dunsmuir (1995), co, = 11.93, cos = 6.80, and cio = 
4.90.) In particular, if n = 50, then the null hypothesis is rejected at level .05 if 
6 > —1 + 6.80/50 = —.864. 

The likelihood ratio test can also be used for testing the unit root hypothesis. The 
likelihood ratio for this problem is L(—1, S(—1)/n)/L (ô, 6), where L (0, o°?) is 
the Gaussian likelihood of the data based on an MA(1) model, S(—1) is the sum of 
squares given by (5.2.11) when 6 = —1, and @ and 6? are the maximum likelihood 
estimators of 6 and o”. The null hypothesis is rejected at level a if 


L(-1, S(-1)/n) aust 
L (ô, 6?) 


where the cutoff value is chosen such that Pọ-—i[àn > CLR] = œ. The limit dis- 
tribution of à„ was derived by Davis et al. (1995), who also gave selected quantiles 
of the limit. It was found that these quantiles provide a good approximation to their 
finite-sample counterparts for time series of length n > 50. The limiting quantiles 
for A, under Hp are crr,o1 = 4.41, Cir..o5 = 1.94, and crr, 10 = 1.00. 


For the overshort data {X,} of Example 3.2.8, the maximum likelihood MA(1) model 
for the mean corrected data {Y, = X, + 4.035} was (see Example 5.4.1) 


Y, = Z,—0.818Z,_;, {Z} ~ WN(O, 2040.75). 


In the structural formulation of this model given in Example 3.2.8, the moving-average 
parameter @ was related to the measurement error variances oĝ and oĉ through the 
equation 


2 
0 —oģ 


1+8? 207 +o? 


(These error variances correspond to the daily measured amounts of fuel in the tank 
and the daily measured adjustments due to sales and deliveries.) A value of 0 = —1 
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indicates that there is no appreciable measurement error due to sales and deliver- 
ies (i.e., o = 0), and hence testing for a unit root in this case is equivalent to 
testing that o = 0. Assuming that the mean is known, the unit root hypothe- 
sis is rejected at a = .05, since —.818 > —1 + 6.80/57 = —.881. The evidence 
against Hp is stronger using the likelihood ratio statistic. Using ITSM and entering 
the MA(1) model 6 = —1 and ø? = 2203.12, we find that —2 In L(—1, 2203.12) = 
604.584, while —2 In L(6, 6?) = 597.267. Comparing the likelihood ratio statistic 
An = 604.584 — 597.267 = 7.317 with the cutoff value cır,.01, we reject Ho at level 
a = .01 and conclude that the measurement error associated with sales and deliveries 
is nonzero. 

In the above example it was assumed that the mean was known. In practice, these 
tests should be adjusted for the fact that the mean is also being estimated. 

Tanaka (1990) proposed a locally best invariant unbiased (LBIU) test for the unit 
root hypothesis. It was found that the LBIU test has slightly greater power than the 
likelihood ratio test for alternatives close to 0 = — 1 but has less power for alternatives 
further away from —1 (see Davis et al., 1995). The LBIU test has been extended to 
cover more general models by Tanaka (1990) and Tam and Reinsel (1995). Similar 
extensions to tests based on the maximum likelihood estimator and the likelihood 
ratio statistic have been explored in Davis, Chen, and Dunsmuir (1996). 


6.4 Forecasting ARIMA Models 


In this section we demonstrate how the methods of Section 3.3 and 5.4 can be adapted 
to forecast the future values of an ARIMA(p, d, q) process {X;}. (The required nu- 
merical calculations can all be carried out using the program ITSM.) 

If d > 1, the first and second moments EX, and E(X;.;,X;) are not determined 
by the difference equations (6.1.1). We cannot expect, therefore, to determine best 
linear predictors for {X,} without further assumptions. 

For example, suppose that {Y,} is a causal ARMA(p, q) process and that Xo is 
any random variable. Define 


t 
X= Xo+% Yj, E O eee 
j=1 


Then {X,,t > 0} is an ARIMA(p, 1, q) process with mean EX, = EXo and au- 
tocovariances E(X;4;X;) — (EXo)* that depend on Var(Xo) and Cov(Xo, Y;), j= 
1,2, .... The best linear predictor of X,,,; based on {1, Xo, X1,..., Xn} is the same 
as the best linear predictor in terms of the set {1, Xo, Yı, ..., Yn}, since each linear 
combination of the latter is a linear combination of the former and vice versa. Hence, 
using P, to denote best linear predictor in terms of either set and using the linearity 
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of P,, we can write 
PaXn+ı = P, (Xo + Yı ae ed Yn+1) = Pa (Xn + Yn+1) = Xn + PiYn- 


To evaluate P, Y,, itis necessary (see Section 2.5) toknow E (XoY;), j=1,...,n+1, 
and E Xo. However, if we assume that Xo is uncorrelated with {Y,,t > 1}, then 
P,,Y,41 is the same (Problem 6.5) as the best linear predictor Ysi of Y,,,; in terms of 
{1, Yi1,..., Yn}, which can be calculated as described in Section 3.3. The assumption 
that Xo is uncorrelated with Y,, Y>,... therefore suffices to determine the best linear 
predictor P,,X,,., in this case. 

Turning now to the general case, we shall assume that our observed process {X;} 
satisfies the difference equations 


(1 -B X, =Y, t=1,2,..., 


where {Y,} is a causal ARMA (p, q) process, and that the random vector (X\_4, ..., 
Xo) is uncorrelated with Y,, t > 0. The difference equations can be rewritten in the 
form 


d 
xan- Devin, t=1,2,.... (6.4.1) 


j=1 


It is convenient, by relabeling the time axis if necessary, to assume that we observe 
Xı-a, X2-a, ..., Xn. (The observed values of {Y,} are then Y;,..., Y„.) As usual, we 
shall use P, to denote best linear prediction in terms of the observations up to time n 
(in this case 1, X;_g,..., X, or equivalently 1, X;_g,..., Xo, Y1,..-, Yn). 

Our goal is to compute the best linear predictors P,,X,,4,. This can be done by 
applying the operator P, to each side of (6.4.1) (with t = n+/) and using the linearity 
of P, to obtain 


d 
d : 
PrXntn = Pa Yasa — >) (F) OD Paans: (6.4.2) 
j=l 
Now the assumption that (X,_7,..., Xo) is uncorrelated with Y,, t > 0, enables us to 
identify P,, Y,4, with the best linear predictor of Y,,,;, in terms of {1, Y1,..., Ya}, and 


this can be calculated as described in Section 3.3. The predictor P,,X,,41 is obtained 
directly from (6.4.2) by noting that P, Xn41-; = Xn41-; foreach j > 1. The predictor 
P,Xn+2 can then be found from (6.4.2) using the previously calculated value of 
P,,Xn41. The predictors P, X13, P,Xn+4, ...can be computed recursively in the same 
way. 

To find the mean squared error of prediction it is convenient to express P,Yn+; 
in terms of {X ;}. For n > 0 we denote the one-step predictors by Veo = P,Y,4, and 
Xi = P,Xn+1. Then from (6.4.1) and (6.4.2) we have 


A 


Xn+1 — Xavi = Yni — Yni, n>l, 
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and hence from (3.3.12), if n > m = max(p, q) and h > 1, we can write 


P q 
Pr¥nsh = Y hi Pa Yasni +Y Orong (Xans — Raris) 643) 
i=l j=h 
Setting ¢*(z) = (1 — z) (z) = 1 — fz —--- — 6%, 2? *4, we find from (6.4.2) and 
(6.4.3) that 
p+d q . 
Pi Xn+h = Ye PX + 0, th—-1,j (x, th—j —_ Xn Hh i) ’ (6.4.4) 
j=l jah 


which is analogous to the h-step prediction formula (3.3.12) for an ARMA process. 
As in (3.3.13), the mean squared error of the h-step predictor is 


P 2 
h-1 J 
a, (h) = E(Xn+n = Ppa X nan) Fa X (£ XrOn th—-r—1,j .) Un+h—j—1s (6.4.5) 


j=0 \r=0 


where 6,9 = 1, 
- -1 
x@) = S pz = (1 = přz—= = pha”) i 
r=0 
and 
n 2 . 5 
Pakta ae (Xess E Rasa) =e (Care = thi) ) 
The coefficients x; can be found from the recursions (3.3.14) with pi replacing @;. 


For large n we can approximate (6.4.5), provided that 6(-) is invertible, by 


h-1 
0, (h) = > Wjo?, (6.4.6) 


j=0 


where 


wz) = Do wiz! = @"@))0@). 
j=0 


6.4.1 The Forecast Function 


Inspection of equation (6.4.4) shows that for fixed n > m = max(p, q), the h-step 
predictors 


g(h) = P, Xnth; 
satisfy the homogeneous linear difference equations 


g(h)— gigth—1)—----@agth—-p—d)=0, h>q, (6.4.7) 
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where $7, ..., Phra are the coefficients of z,..., z?” t4 in 
b*(z) = (1 — 2) 2). 


The solution of (6.4.7) is well known from the theory of linear difference equations 
(see TSTM, Section 3.6). If we assume that the zeros of #(z) (denoted by &),..., Ep) 
are all distinct, then the solution is 


g(h) = ao + ayh +--+ + agh! + bié" +++. + bye", h>q-—p-—d, (6.4.8) 


where the coefficients a1, ... , ag and b;,...,b, can be determined from the p + d 
equations obtained by equating the right-hand side of (6.4.8) forg — p-d <h<q 
with the corresponding value of g(h) computed numerically (for h < 0, P ,Xn+n = 
Xn+n, and for1 < h < q, Pa Xn+n can be computed from (6.4.4) as already described). 
Once the constants a; and b; have been evaluated, the algebraic expression (6.4.8) 
gives the predictors for all h > q — p — d. In the case q = 0, the values of g(h) in 
the equations for do, ..., aq, b1, ..., bp are simply the observed values g(h) = Xn+n, 
—p — d < h < 0, and the expression (6.4.6) for the mean squared error is exact. 

The calculation of the forecast function is easily generalized to deal with more 
complicated ARIMA processes. For example, if the observations X_13, X-12,..., Xn 
are differenced at lags 12 and 1, and (1 — B)(1 — B")x, is modeled as a causal 
invertible ARMA(p, q) process with mean u and max(p, q) < n, then {X,} satisfies 
an equation of the form 


(BÐ — B)( — B?)X, — u] = O(B)Z,, {Z,}~WN(0,07), (6.4.9) 
and the forecast function g(h) = P,,X,+; satisfies the analogue of (6.4.7), namely, 
$(B)(1— B\(1— B”)g(h) =o (un, h >q. (6.4.10) 


To find the general solution of these inhomogeneous linear difference equations, it 
suffices (see TSTM, Section 3.6) to find one particular solution of (6.4.10) and then 
add to it the general solution of the same equations with the right-hand side set equal 
to zero. A particular solution is easily found (by trial and error) to be 


ph? 
h = 
g(h) z4 


and the general solution is therefore 


uh? at ijn — —h 
g(h) = A ao +ayh- 2 cje” 6 + big" + + bE", 


j=l 
h>q-—p-—13. (6.4.11) 
(The terms ao and ah correspond to the double root z = 1 of the equation ¢(z)(1 — 


z)(1—z!*) = 0, and the subsequent terms to each of the other roots, which we assume 
to be distinct.) For q — p — 13 < h < 0, g(h) = Xn+n, and for 1 < h < q, the values 
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of g(h) = P, Xn+4, can be determined resursively from the equations 
P,Xnth =U + P,Xn-1 + P,Xn—12 DA: P,Xn—3 + Pa Yain 


where {Y,} is the ARMA process Y, = (1 — B)(1 — B"’)X, — u. Substituting these 
values of g(h) into (6.4.11), we obtain a set of p + 13 equations for the coefficients 
ai, bj, and cg. Solving these equations then completes the determination of g(h). 
The large-sample approximation to the mean squared error is again given by 
(6.4.6), with y; redefined as the coefficient of z/ in the power series expansion of 


4(z)/[ — z)(1 — z!?)@)]. 
An ARIMA(1,1,0) model 


In Example 5.2.4 we found the maximum likelihood AR(1) model for the mean- 
corrected differences X, of the Dow Jones Utilities Index (Aug. 28—Dec. 18, 1972). 
The model was 


X, —0.4471X,-; = Z, {Z} ~ WN(O, 0.1455), (6.4.12) 


where X, = D, — D,_; — 0.1336, t = 1,..., 77, and {D;, t =0,1,2,..., 77} is the 
original series. The model for {D;} is thus 


(1 — 0.4471 B)[(1 — B)D, — 0.1336] = Z,, {Z,} ~ WN(O, 0.1455). 
The recursions for g(h) therefore take the form 
(1 — 0.4471 B)(1 — B)g(h) = 0.5529 x 0.1336 = 0.07387, h >0. (6.4.13) 
A particular solution of these equations is g(h) = 0.1336h, so the general solution is 
g(h) = 0.1336h + a + b(.4471)", h > —2. (6.4.14) 


Substituting g(—1) = Dy = 122 and g(0) = D77 = 121.23 in the equations with 
h = —1 and h = 0, and solving for a and b gives 


g(h) = 0.1366h + 120.50 + 0.7331(.4471)". 
Setting h = 1 and h = 2 gives 
P} D35 = 120.97 and P77D7 = 120.94. 
From (6.4.5) we find that the corresponding mean squared errors are 
on (1) = vn = 07 = .1455 
and 
0o% (2) = vg + $7 77 = o° (1+ 1.44717) = 4502. 


(Notice that the approximation (6.4.6) is exact in this case.) The predictors and their 
mean squared errors are easily obtained from the program ITSM by opening the file 
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DOWJ.TSM, differencing at lag 1, fitting a preliminary AR(1) model to the mean- 
corrected data with Burg’s algorithm, and then selecting Model>Estimation>Max 
likelihood to find the maximum likelihood AR(1) model. The predicted values and 
their mean squared errors are then found using the option Forecasting>ARMA. 


6.5 Seasonal ARIMA Models 


Definition 6.5.1 


We have already seen how differencing the series {X,} at lag s is a convenient way 
of eliminating a seasonal component of period s. If we fit an ARMA(p, q) model 
b(B)Y, = 0 (B)Z, to the differenced series Y, = (1 — B*)X,, then the model for the 
original series is (B) (1 — B°) X, = 0(B)Z,. This is a special case of the general 
seasonal ARIMA (SARIMA) model defined as follows. 


If d and D are nonnegative integers, then {X,} is a seasonal ARIMA(p, d,q) x 
(P, D, Q); process with period s if the differenced series Y, = (1— B)! (1 — B°)? X, 
is a causal ARMA process defined by 


(B)® (B5) Y, = 0(B)@ (B°) Z,, {Z} ~ WN (0, 0°), (6.5.1) 


where $(z) = 1 — diz — +++ — pz”, O(z) = 1 — Biz — +++ — pz”, O(z) = 
1+0ız +- +0,71, and O(@) = 1+ O17 + +++ + @gz®. 


Remark 1. Note that the process {Y,} is causal if and only if ¢ (z) 4 0 and ®(z) 40 
for |z| < 1. In applications D is rarely more than one, and P and Q are typically less 
than three. 


Remark 2. The equation (6.5.1) satisfied by the differenced process {Y,} can be 
rewritten in the equivalent form 


pb (B)Y, = 0*(B)Z,, (6.5.2) 


where ¢*(-), 0*(-) are polynomials of degree p + sP and q + s Q, respectively, whose 
coefficients can all be expressed in terms of ¢),...,@ ), P1,..., Pp, O1,..., 0q, and 
©),..., Og. Provided that p < s and q < s, the constraints on the coefficients of 
o*(-) and 6*(-) can all be expressed as multiplicative relations 


pa =o $1,200.57 =1,...,8-1, 
and 


6* . = 0* 0* PS 15 2s PSs a= 1, 


istj is’ j? 


In Section 1.5 we discussed the classical decomposition model incorporating trend, 
seasonality, and random noise, namely, X, = m, + s, + Y,. In modeling real data 
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Example 6.5.3 


it might not be reasonable to assume, as in the classical decomposition model, that 
the seasonal component s, repeats itself precisely in the same way cycle after cycle. 
Seasonal ARIMA models allow for randomness in the seasonal pattern from one 
cycle to the next. 


Suppose we have r years of monthly data, which we tabulate as follows: 


Year/Month 1 2 cha 12 
1 Yı Y2 fet Yiz 
2 Yi3 Yi4 tee Y24 
3 Yo5 Y26 P Y36 
r Yq 4120-1) Yo412(r-1) a Y42-412(r-1) 


Each column in this table may itself be viewed as a realization of a time series. Suppose 
that each one of these twelve time series is generated by the same ARMA(P, Q) 
model, or more specifically, that the series corresponding to the jth month, Yj412,, t = 


0,...,7 — 1, satisfies a difference equation of the form 
Yiri = Yina- + +++ + PPV j+i20-p) + Uj+in (6.5.3) 
+ OU jpa- + + OoU;j+120-0) 
where 
{Uji t =...,—1,0,1,...} ~ WN (0, 05). (6.5.4) 


Then since the same ARMA(P, Q) model is assumed to apply to each month, (6.5.3) 
holds for each j = 1,..., 12. (Notice, however, that E(U,U,+n) is not necessarily 
zero except when A is an integer multiple of 12.) We can thus write (6.5.3) in the 
compact form 


© (B’) Y, = © (B?) U,, (6.5.5) 


where ®(z) = 1—®,z—---—®pz”, O(z) = 14+ Oiz+-- -+@gz2, and {Uj +12, p 
...,—1,0,1,...} ~ WN (0, og) for each j. We refer to the model (6.5.5) as the 
between-year model. 


Suppose P = 0, Q = 1, and ©; = —0.4 in (6.5.5). Then the series for any particular 
month is amoving-average of order 1. If E (U,U,+n) = 0 for all h, i.e., if the white noise 
sequences for different months are uncorrelated with each other, then the columns 
themselves are uncorrelated. The correlation function for such a process is shown in 
Figure 6.15. 


Suppose P = 1, Q = 0, and ®, = 0.7 in (6.5.5). In this case the 12 series (one for 
each month) are AR(1) processes that are uncorrelated if the white noise sequences 
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The ACF of the model or, j i i 
X: = U, — 0.4U;12 0 10 20 30 40 50 60 
of Example 6.5.2. Lag 


for different months are uncorrelated. A graph of the autocorrelation function of this 
process is shown in Figure 6.16. 


In each of the Examples 6.5.1, 6.5.2, and 6.5.3, the 12 series corresponding to the 
different months are uncorrelated. To incorporate dependence between these series 
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Figure 6-16 3 
The ACF of the model SE i i i i j l 
X- 0.7X- = U; 0 10 20 30 40 50 60 


of Example 6.5.3. Lag 
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we allow the process {U;} in (6.5.5) to follow an ARMA(p, q) model, 
(B)U, = 0(B)Z,, {Z} ~ WN (0, o°). (6.5.6) 


This assumption implies possible nonzero correlation not only between consecutive 
values of U,, but also within the twelve sequences {U;+ixz, t = ..., —1,0, 1,...}, 
each of which was assumed to be uncorrelated in the preceding examples. In this 
case (6.5.4) may no longer hold; however, the coefficients in (6.5.6) will frequently 
have values such that E (U,U,+12;) is small for j = +1, 2, . . .. Combining the two 
models (6.5.5) and (6.5.6) and allowing for possible differencing leads directly to 
Definition 6.5.1 of the general SARIMA model as given above. 

The first steps in identifying SARIMA models for a (possibly transformed) data 
set are to find d and D so as to make the differenced observations 


Y, = (1 — B) (1 — BY)’ X, 


stationary in appearance (see Sections 6.1—6.3). Next we examine the sample ACF 
and PACF of {Y,} at lags that are multiples of s for an indication of the orders P and 
Q in the model (6.5.5). If ô(-) is the sample ACF of {Y,}, then P and Q should be 
chosen such that ô (ks), k = 1, 2, . . ., is compatible with the ACF of an ARMA (P, Q) 
process. The orders p and q are then selected by trying to match 6(1),..., O(s — 1) 
with the ACF ofan ARMA (p, q) process. Ultimately, the AICC criterion (Section 5.5) 
and the goodness of fit tests (Section 5.3) are used to select the best SARIMA model 
from competing alternatives. 

For given values of p, d, q, P, D, and Q, the parameters ¢, 0, ®, ©, and o? can 
be found using the maximum likelihood procedure of Section 5.2. The differences 
Y, = (1 — B)’ (1 — B°)’ X, constitute an ARMA(p + s P, q + s Q) process in which 
some of the coefficients are zero and the rest are functions of the (p + P +q + Q)- 
dimensional vector 3’ = (¢’, ®’, 0', ©’). For any fixed 8 the reduced likelihood (6) 
of the differences Y,+a+sD, - - - , Yn is easily computed as described in Section 5.2. The 
maximum likelihood estimator of 6 is the value that minimizes £(3), and the maxi- 
mum likelihood estimate of o? is given by (5.2.10). The estimates can be found using 
the program ITSM by specifying the required multiplicative relationships among the 
coefficients as given in Remark 2 above. 

A more direct approach to modeling the differenced series {Y,} is simply to fit 
a subset ARMA model of the form (6.5.2) without making use of the multiplicative 
form of $*(-) and 6*(-) in (6.5.1). 


Monthly accidental deaths 


In Figure 1.27 we showed the series {Y, = (1 — B'?)(1 — B)X,} obtained by differ- 
encing the accidental deaths series {X,} once at lag 12 and once at lag 1. The sample 
ACF of {Y,} is shown in Figure 6.17. 
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Figure 6-17 

The sample ACF of the 
differenced accidental 
deaths {VV12%;}. 
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The values 6(12) = —0.333, 0(24) = —0.099, and p(36) = 0.013 suggest a 
moving-average of order 1 for the between-year model (i.e., P = 0 and Q = 1). 
Moreover, inspection of p(1),..., (11) suggests that o(1) is the only short-term 
correlation different from zero, so we also choose a moving-average of order 1 for 
the between-month model (i.e., p = 0 and q = 1). Taking into account the sample 
mean (28.831) of the differences {Y,}, we therefore arrive at the model 


Y, = 28.831+(1+6:B)1+@,B")Z,, {Z} ~ WN (0,0°), (6.5.7) 


for the series {Y,}. The maximum likelihood estimates of the parameters are obtained 
from ITSM by opening the file DEATHS.TSM and proceeding as follows. After 
differencing (at lags 1 and 12) and then mean-correcting the data, choose the op- 
tion Model>Specify. In the dialog box enter an MA(13) model with 0; = —0.3, 
i2 = —0.3, 6,3 = 0.09, and all other coefficients zero. (This corresponds to the 
initial guess Y, = (1—0.3B)(1—0.3B'")Z,.) Then choose Mode1>Estimat ion>Max 
likelihood and click on the button Constrain optimization. Specify the num- 
ber of multiplicative relations (one in this case) in the box provided and define the 
relationship by entering 1, 12, 13 to indicate that 0; x 6). = 6)3. Click OK to return 
to the Maximum Likelihood dialog box. Click OK again to obtain the parameter 
estimates 


0; = — 0.478, 
©, = — 0.591, 
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and 
ê? = 94255, 


with AICC value 855.53. The corresponding fitted model for { X,} is thus the SARIMA 
(0, 1, 1) x (0, 1, 1)12 process 


VVi2X, = 28.831 + (1 — 0.478B) (1 — 0.588B"") Z,, (6.5.8) 


where {Z,} ~ WN(O, 94390). 

If we adopt the alternative approach of fitting a subset ARMA model to {Y,} 
without seeking a multiplicative structure for the operators @*(B) and 6*(B) in (6.5.2), 
we begin by fitting a preliminary MA(13) model (as suggested by Figure 6.17) to 
the series {Y,}. We then fit a maximum likelihood MA(13) model and examine the 
standard errors of the coefficient estimators. This suggests setting the coefficients at 
lags 2, 3, 8, 10, and 11 equal to zero, since as these are all less than one standard error 
from zero. To do this select Model>Estimation>Max likelihood and click on the 
button Constrain optimization. Then highlight the coefficients to be set to zero 
and click on the button Set to zero. Click OK to return to the Maximum Likelihood 
Estimation dialog box and again to carry out the constrained optimization. The 
coefficients that have been set to zero will be held at that value, and the optimization 
will be with respect to the remaining coefficients. This gives amodel with substantially 
smaller AICC than the unconstrained MA(13) model. Examining the standard errors 
again we see that the coefficients at lags 4, 5, and 7 are promising candidates to be 
set to zero, since each of them is less than one standard error from zero. Setting these 
coefficients to zero in the same way and reoptimizing gives a further reduction in 
AICC. Setting the coefficient at lag 9 to zero and reoptimizing again gives a further 
reduction in AICC (to 855.61) and the fitted model 


VVi2X, = 28.831 + Z, — 0.596Z,_; — 0.407Z;_6 — 0.685 Z,_12 + 0.460Z;_13, 


(6.5.9) 
{Z,} ~ WN(0, 71240). 


The AICC value 855.61 is quite close to the value 855.53 for the model (6.5.8). The 
residuals from the two models are also very similar, the randomness tests (with the 
exception of the difference-sign test) yielding high p-values for both. 


6.5.1 Forecasting SARIMA Processes 


Forecasting SARIMA processes is completely analogous to the forecasting of ARIMA 
processes discussed in Section 6.4. Expanding out the operator (1 — B)? ( t= B5)’ 
in powers of B, rearranging the equation 


(1 — B) (1 — BY)? X, =Y, 
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and setting t = n + h gives the analogue 


d+Ds 


Xnth = Vata + >D ajXn+h-j (6.5.10) 


j=1 


of equation (6.4.2). Under the assumption that the first d+ Ds observations X _g_ps41, 
..., Xo are uncorrelated with {Y,, t > 1}, we can determine the best linear predictors 
P,Xn+n Of Xnẹn based on {1, X_g_ps41,---, Xn} by applying P, to each side of 
(6.5.10) to obtain 


d+Ds 
PX nah = PaYnin + >. Gj PaXnsn—j- (6.5.11) 
j=l 


The first term on the right is just the best linear predictor of the (possibly nonzero- 
mean) ARMA process {Y,} in terms of {1, Y;,..., Yp}, which can be calculated as 
described in Section 3.3. The predictors P„,X,„+n can then be computed recursively 
for h = 1,2,... from (6.5.11), if we note that P Xn41-; = Xn+1-; foreach j > 1. 

An argument analogous to the one leading to (6.4.5) gives the prediction mean 
squared error as 


h—1 


: 2 
J! 
o} (h) E E(Xn+n y P, Xman) = ye (> Xr On+h r—l,j .) Vn+h j—1> (6.5.12) 
r=0 


j=0 


where 6,,; and v, are obtained by applying the innovations algorithm to the differenced 
series {Y,} and 


oe -1 
Ior 32 = [o@o(z')a - z)? (1 — aj] kl] <1. 
r=0 
For large n we can approximate (6.5.12), if (z)© (z°) is nonzero for all |z| < 1, by 
h—1 
MOD yjo, (6.5.13) 
j=0 


where 


|z| < 1. 


% j 0(z)O© (z5) 
= Jt 


The required calculations can all be carried out with the aid of the program ITSM. 
The mean squared errors are computed from the large-sample approximation (6.5.13) 
if the fitted model is invertible. If the fitted model is not invertible, ITSM computes the 
mean squared errors by converting the model to the equivalent (in terms of Gaussian 
likelihood) invertible model and then using (6.5.13). 
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Table 6.1 Predicted values of the Accidental Deaths series for t = 73, 
..., 78, the standard deviations o; of the prediction errors, 
and the corresponding observed values of X; for the same 
period. 
t 73 74 75 76 77 78 
Model (6.5.8) 
Predictors 8441 7704 8549 8885 9843 10279 
Or 308 348 383 415 445 474 
Model (6.5.9) 
Predictors 8345 7619 8356 8742 9795 10179 
Or 292 329 366 403 442 486 
Observed values 
x 7798 7406 8363 8460 9217 9316 
Example 6.5.5 Monthly accidental deaths 


Continuing with Example 6.5.4, we next use ITSM to predict six future values of 
the Accidental Deaths series using the fitted models (6.5.8) and (6.5.9). First fit the 
desired model as described in Example 6.5.4 or enter the data and model directly 
by opening the file DEATHS.TSM, differencing at lags 12 and 1, subtracting the 
mean, and then entering the MA(13) coefficients and white noise variance using 
the option Model>Specify. Select Forecasting>ARMA, and you will see the ARMA 
Forecast dialog box. Enter 6 for the number of predicted values required. You will 
notice that the default options in the dialog box are set to generate predictors of 
the original series by reversing the transformations applied to the data. If for some 
reason you wish to predict the transformed data, these check marks can be removed. 
If you wish to include prediction bounds in the graph of the predictors, check the 
appropriate box and specify the desired coefficient (e.g., 95%). Click OK, and you 
will see a graph of the data with the six predicted values appended. For numerical 
values of the predictors and prediction bounds, right-click on the graph and then on 
Info. The prediction bounds are computed under the assumption that the white noise 
sequence in the ARMA model for the transformed data is Gaussian. Table 6.1 shows 
the predictors and standard deviations of the prediction errors under both models 
(6.5.8) and (6.5.9) for the Accidental Deaths series. 


6.6 Regression with ARMA Errors 


6.6.1 OLS and GLS Estimation 


In standard linear regression, the errors (or deviations of the observations from the 
regression function) are assumed to be independent and identically distributed. In 
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many applications of regression analysis, however, this assumption is clearly vio- 
lated, as can be seen by examination of the residuals from the fitted regression and 
their sample autocorrelations. It is often more appropriate to assume that the errors 
are observations of a zero-mean second-order stationary process. Since many auto- 
correlation functions can be well approximated by the autocorrelation function of a 
suitably chosen ARMA(p, q) process, it is of particular interest to consider the model 


Y, =x 6+ Wee t=1,...,ï, (6.6.1) 
or in matrix notation, 
Y=X6+W, (6.6.2) 
where Y = (Y,,..., Y„y is the vector of observations at times t = 1,...,n, X 
is the design matrix whose tth row, x) = (x7, ..-, Xk), consists of the values of 
the explanatory variables at time t, G = (61, ..., Bx)’ is the vector of regression 
coefficients, and the components of W = (W,,..., W,,)’ are values of a causal zero- 
mean ARMA (p, q) process satisfying 
o(B)W, = 0(B)Z,, {Z} ~ WN (0, o°). (6.6.3) 
The model (6.6.1) arises naturally in trend estimation for time series data. For 
example, the explanatory variables x,, = 1, x = t, and x; = t° can be used to 
estimate a quadratic trend, and the variables x; = 1, x,. = cos(wft), and x, = sin (œt) 


can be used to estimate a sinusoidal trend with frequency w. The columns of X are 
not necessarily simple functions of t as in these two examples. Any specified column 
of relevant variables, e.g., temperatures at times t = 1,...,, can be included in the 
design matrix X, in which case the regression is conditional on the observed values 
of the variables included in the matrix. 

The ordinary least squares (OLS) estimator of 68 is the value, Bors, which 
minimizes the sum of squares 

n 
(Y — XA) (Y— XS) =)-(¥,-x/8)’. 
t=1 

Equating to zero the partial derivatives with respect to each component of 8 and 
assuming (as we shall) that X’X is nonsingular, we find that 


Bors = (X'X)!X’Y. (6.6.4) 


(If X’X is singular, Bors is not uniquely determined but still satisfies (6.6.4) with 
(X’X)~! any generalized inverse of X'X.) The OLS estimate also maximizes the 
likelihood of the observations when the errors W;,..., W, are iid and Gaussian. If 
the design matrix X is nonrandom, then even when the errors are non-Gaussian and 
dependent, the OLS estimator is unbiased (i.e., E (Bors) = ß) and its covariance 
matrix is 


Cov(Bois) = (X'X) XT, X (XX), (6.6.5) 
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where r, = E (ww’) is the covariance matrix of W. 


The generalized least squares (GLS) estimator of G is the value Gai s that 
minimizes the weighted sum of squares 


(Y — XBT; Y — XB). (6.6.6) 


Differentiating partially with respect to each component of 6 and setting the deriva- 
tives equal to zero, we find that 


Bos = (X'T7'X)' X'T7'Y. (6.6.7) 


If the design matrix X is nonrandom, the GLS estimator is unbiased and has covariance 
matrix 


Cov (Bars) = (X'T, 1X). (6.6.8) 


It can be shown that the GLS estimator is the best linear unbiased estimator of 8, i.e., 
for any k-dimensional vector ¢ and for any unbiased estimator 68 of 8 that is a linear 
function of the observations Y;,..., Y,,, 


Var (Bors) < Var (eB) F 
In this sense the GLS estimator is therefore superior to the OLS estimator. However, 
it can be computed only if @ and @ are known. 
Let V(@, 0) denote the matrix o~°I,, and let T(@, 0) be any square root of V~! 


(ie., a matrix such that T’T = V~'). Then we can multiply each side of (6.6.2) by T 
to obtain 


TY =—TXB+TW, (6.6.9) 


a regression equation with coefficient vector 8, data vector TY, design matrix T X, 
and error vector TW. Since the latter has uncorrelated, zero-mean components, each 
with variance o”, the best linear estimator of 3 in terms of TY (which is clearly the 
same as the best linear estimator of G in terms of Y, i.e., Bas) can be obtained by 
applying OLS estimation to the transformed regression equation (6.6.9). This gives 


Bas = (X'T'TX) ` X'T'TY, (6.6.10) 


which is clearly the same as (6.6.7). Cochrane and Orcutt (1949) pointed out that if 
{W,} is an AR(p) process satisfying 


$(B)W,; = Z, {Z:} ~ WN (0,07), 


then application of (B) to each side of the regression equations (6.6.1) transforms 
them into regression equations with uncorrelated, zero-mean, constant-variance er- 
rors, so that ordinary least squares can again be used to compute best linear unbiased 
estimates of the components of 6 in terms of Y* = ¢(B)Y,, t = p+1,...,n. 
This approach eliminates the need to compute the matrix T but suffers from the 
drawback that Y* does not contain all the information in Y. Cochrane and Orcutt’s 


6.6 


Regression with ARMA Errors 213 


transformation can be improved, and at the same generalized to ARMA errors, as 
follows. 

Instead of applying the operator (B) to each side of the regression equations 
(6.6.1), we multiply each side of equation (6.6.2) by the matrix T(@, 0) that maps 
{W,} into the residuals (see (5.3.1)) of {W,} from the ARMA model (6.6.3). We 
have already seen how to calculate these residuals using the innovations algorithm in 
Section 3.3. To see that T is a square root of the matrix V as defined in the previous 
paragraph, we simply recall that the residuals are uncorrelated with zero mean and 
variance o°, so that 


Cov(TW) =TT,T' =07, 
where J is then x n o matrix. Hence 
T'T =0° =v, 


GLS estimation of 3 can therefore be carried out by multiplying each side of (6.6.2) 
by T and applying ordinary least squares to the transformed regression model. It 
remains only to compute TY and TX. 

Any data vector d = (d,,..., d,)’ can be left-multiplied by T simply by reading 
it into ITSM, entering the model (6.6.3), and pressing the green button labeled RES, 
which plots the residuals. (The calculations are performed using the innovations 
algorithm as described in Section 3.3.) The GLS estimator ars is computed as 
follows. The data vector Y is left-multiplied by T to generate the transformed data 
vector Y*, and each column of the design matrix X is left-multiplied by T to generate 
the corresponding column of the transformed design matrix X*. Then 


Bors = (X" X*) | X”Y*. (6.6.11) 


The calculations of Y*, X*, and hence of Bais; are all carried out by the program 
ITSM in the option Regression>Estimation>Generalized LS. 


6.6.2 ML Estimation 


If (as is usually the case) the parameters of the ARMA(p, q) model for the errors 
are unknown, they can be estimated together with the regression coefficients by 
maximizing the Gaussian likelihood 


L (B, $, 0, 07) = 0)" (det T,)- exp {-3" — XB)T;' (Y — xe} ; 


where T, (ġ, 0, o?) is the covariance matrix of W = Y — X$. Since {W,} is an 
ARMA(p, q) process with parameters (9, 0,0 2), the maximum likelihood estimators 


B, Q, and @ are found (as in Section 5.2) by minimizing 


LB, $, 0) = In (n7'S(B, $, 0)) + Y nr 3 (6.6.12) 
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where 


a RAD 
5(B, 4,0) =X (W.-W) rn, 
t=1 

W, is the best one-step predictor of W,, and r,_;o7 is its mean squared error. The 
function £(G, h, 0) can be expressed in terms of the observations {Y,} and the param- 
eters G, h, and @ using the innovations algorithm (see Section 3.3) and minimized 
numerically to give the maximum likelihood estimators, B i d, and @. The maximum 
likelihood estimator of ø? is then given, as in Section 5.2, by 6? = S (8, ọ, Ô) /n. 

An extension of an iterative scheme, proposed by Cochrane and Orcutt (1949) for 
the case q = 0, simplifies the minimization considerably. It is based on the observation 
that for fixed @ and 0, the value of 6 that minimizes £(G, @, 0) is Bors (d, 0), which 
can be computed algebraically from (6.6.11) instead of by searching numerically for 
the minimizing value. The scheme is as follows. 


(i) Compute Bors and the estimated residuals Y, — x; Bors, t=1,...,n. 


(ii) Fit an ARMA(p.q) model by maximum Gaussian likelihood to the estimated 
residuals. 


(iii) For the fitted ARMA model compute the corresponding estimator Bors from 
(6.6.11). 


(iv) Compute the residuals Y, — x’ Baxs: t = 1,...,n, and return to (ii), stopping 
when the estimators have stabilized. 


If {W,} is a causal and invertible ARMA process, then under mild conditions 
on the explanatory variables x,, the maximum likelihood estimates are asymptoti- 
cally multivariate normal (see Fuller, 1976). In addition, the estimated regression 
coefficients are asymptotically independent of the estimated ARMA parameters. 

The large-sample covariance matrix of the ARMA parameter estimators, suitably 
normalized, has a complicated form that involves both the regression variables x, and 
the covariance function of {W,}. It is therefore convenient to estimate the covari- 
ance matrix as —H~!, where H is the Hessian matrix of the observed log-likelihood 
evaluated at its maximum. 

The OLS, GLS, and maximum likelihood estimators of the regression coefficients 
all have the same asymptotic covariance matrix, so in this sense the dependence does 
not play a major role. However, the asymptotic covariance of both the OLS and GLS 
estimators can be very inaccurate if the appropriate covariance matrix T, is not used in 
the expressions (6.6.5) and (6.6.8). This point is illustrated in the following examples. 


Remark 1. The use of the innovations algorithm for GLS and ML estimation extends 
to regression with ARIMA errors (see Example 6.6.3 below) and FARIMA errors 
(FARIMA processes are defined in Section 10.5). 
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Example 6.6.1 


Example 6.6.2 


The overshort data 


The analysis of the overshort data in Example 3.2.8 suggested the model 
Y, = B + W,, 


where —£ is interpreted as the daily leakage from the underground storage tank and 
{W,} is the MA(1) process 


W, =Z,+0Z,1, {Z} ~ WN (0, 0°). 


(Here k = 1 and xı = 1.) The OLS estimate of 6 is simply the sample mean 
Bois = Y, = —4.035. Under the assumption that {W,} is iid noise, the estimated 
variance of the OLS estimator of £ is , (0)/57 = 59.92. However, since this estimate 
of the variance fails to take dependence into account, it is not reliable. 

To find maximum Gaussian likelihood estimates of 6 and the parameters of {W,} 
using ITSM, open the file OSHORTS.TSM, select the option Regression>Specify 
and check the box marked Include intercept term only. Then press the blue 
GLS button and you will see the estimated value of 8. (This is in fact the same 
as the OLS estimator since the default model in ITSM is WN(0,1).) Then select 
Model>Estimation>Autofit and press Start. The autofit option selects the mini- 
mum AICC model for the residuals, 


W, = Z, — .818Z,_1, {Z,} ~ WN(O, 2041), (6.6.13) 


and displays the estimated MA coefficient §© = 


—.818 and the corresponding GLS 
estimate BS = —4.745, with a standard error of 1.188, in the Regression esti- 
mates window. (If we reestimate the variance of the OLS estimator, using (6.6.5) 
with T57 computed from the model (6.6.13), we obtain the value 2.214, a drastic re- 
duction from the value 59.92 obtained when dependence is ignored. For a positively 
correlated time series, ignoring the dependence would lead to underestimation of the 
variance.) 

Pressing the blue MLE button will reestimate the MA parameters using the resid- 
uals from the updated regression and at the same time reestimate the regression 
coefficient, printing the new parameters in the Regression estimates window. 
After this operation has been repeated several times, the parameters will stabilize, as 
shown in Table 6.2. Estimated 95% confidence bounds for 8 using the GLS estimate 
are —4.75 + 1.96(1.408)!/ = (—7.07, —2.43), strongly suggesting that the storage 
tank has a leak. Such a conclusion would not have been reached without taking into 
account the dependence in the data. 


The lake data 


In Examples 5.2.4 and 5.5.2 we found maximum likelihood ARMA(1,1) and AR(2) 
models for the mean-corrected lake data. Now let us consider fitting a linear trend 
to the data with AR(2) noise. The choice of an AR(2) model was suggested by an 
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Table 6.2 


Estimates of B and 6, for the 
overshort data of Example 6.6.1. 


Iteration i ĝo B? 
0 0 —4.035 
1 —.818 —4.745 
2 —.848 —4.780 
3 —.848 —4.780 


analysis of the residuals obtained after removing a linear trend from the data using 
OLS. Our model now takes the form 


Y, = Bo + Bit + W,, 
where {W,} is the AR(2) process satisfying 
W, = 1W,-1 + @2Wi-2+ Zi, {Z} ~ WN (0, 0°). 


From Example 1.3.5, we find that the OLS estimate of 8 is Bois = (10.202, —.0242)’. 
If we ignore the correlation structure of the noise, the estimated covariance matrix T’, 
of W is 7 (0) (where Iis the identity matrix). The corresponding estimated covariance 
matrix of Bots is (from (6.6.5)) 


-1 


n yi | 
%0) (X'X)' = i A -| we tue | (6.6.14) 


S De | ~ | —.00110 .00002 


However, the estimated model for the noise process, found by fitting an AR(2) model 
to the residuals Y, — Go, 5X1, is 


W, = 1.008W,_; — .295W,_2 + Z,, {Z,} ~ WN(O, .4571). 


Assuming that this is the true model for {W,}, the GLS estimate is found to be 
(10.091, —.0216)’, in close agreement with the OLS estimate. The estimated covari- 
ance matrices for the OLS and GLS estimates are given by 


> 22177  —.00335 
SoY (Bors) z o 00007 | 


and 


; 21392  —.00321 
Cov (Bais) = E .00006 |. 


Notice how the estimated variances of the OLS and GLS estimators are nearly three 
times the magnitude of the corresponding variance estimates of the OLS calculated 
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Example 6.6.3 


Table 6.3 


under the independence assumption (see (6.6.8)). Estimated 95% confidence bounds 
for the slope £, using the GLS estimate are —.0216 + 1.96(.00006)!/ = —.0216 + 
.0048, indicating a significant decreasing trend in the level of Lake Huron during the 
years 1875-1972. 

The iterative procedure described above was used to produce maximum like- 
lihood estimates of the parameters. The calculations using ITSM are analogous to 
those in Example 6.6.1. The results from each iteration are summarized in Table 6.3. 
As in Example 6.6.1, the convergence of the estimates is very rapid. 


Seat-belt legislation; SBL.TSM 


Figure 6.18 shows the numbers of monthly deaths and serious injuries Y,, t = 
1,..., 120, on UK roads for 10 years beginning in January 1975. They are filed 
as SBL.TSM. Seat-belt legislation was introduced in February 1983 in the hope of 
reducing the mean number of monthly “deaths and serious injuries,’ (from t = 99 
onwards). In order to study whether or not there was a drop in mean from that time 
onwards, we consider the regression, 


Y, =a +bf(t)+W,, t=1,..., 120, (6.6.15) 


where f, = 0 for 1 < t < 98, and f, = 1 fort > 99. The seat-belt legislation 
will be considered effective if the estimated value of the regression coefficient b 
is significantly negative. This problem also falls under the heading of intervention 
analysis (see Section 10.2). 

OLS regression based on the model (6.6.15) suggests that the error sequence 
{W,} is highly correlated with a strong seasonal component of period 12. (To do the 
regression using ITSM proceed as follows. Open the file SBL.TSM, select Regres- 
sion>Specify, check only Include intercept term and Include auxiliary 
variables, press the Browse button, and select the file SBLIN.TSM, which contains 
the function f, of (6.6.15) and enter 1 for the number of columns. Then select Re- 
gression>Estimation>Generalized LS. The estimates of the coefficients a and 
b are displayed in the Regression estimates window, and the data become the 


Estimates of G and ¢ for the lake data 
after 3 iterations. 


Iteration i o? oy By BY 
0 (0) O 10.20 —.0242 
1 1.008 —.295 10.09 —.0216 


2 1.005 —.291 10.09 —.0216 
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Figure 6-18 

Monthly deaths and serious 
injuries { Y,} on UK roads, 
Jan., ‘75 - Dec., 84. 


estimates of the residuals {W,}.) The graphs of the data and sample ACF clearly sug- 
gest a strong seasonal component with period 12. In order to transform the model 
(6.6.15) into one with stationary residuals, we therefore consider the differenced data 
X, = Y, — Y;-12, which satisfy 


X, = bg, + N,, t = 13,..., 120, (6.6.16) 


where g, = 1 for 99 < t < 110, g, = 0 otherwise, and {N, = W, — W,—ı2} is a 
stationary sequence to be represented by a suitably chosen ARMA model. The series 
{X,} is contained in the file SBLD.TSM, and the function g, is contained in the file 
SBLDIN.TSM. 

The next step is to perform ordinary least squares regression of X, on g; following 
steps analogous to those of the previous paragraph (but this thime checking only the 
box marked Include auxiliary variables in the Regression Trend Func- 
tion dialog box) and again using the option Regression>Estimation>General- 
ized LS or pressing the blue GLS button. The model 


X, = —346.928, + N,, (6.6.17) 


is then displayed in the Regression estimates window together with the assumed 
noise model (white noise in this case). Inspection of the sample ACF of the residuals 
suggests an MA(13) or AR(13) model for {N,}. Fitting AR and MA models of order 
up to 13 (with no mean-correction) using the option Model>Estimation>Autofit 
gives an MA(12) model as the minimum AICC fit for the residuals. Once this model 
has been fitted, the model in the Regression estimates window is automatically 


7 j j 
“iP 7 1 P 
a i 
i p 
A ny D 
i 
OL o 
A h 
i 
oO 
n h p 
O Ol. H 
s - 
Te 
ki m) 
NL 


l l l l l l l l l l 
1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 


Problems 219 
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Figure 6-19 

The differenced deaths and 
serious injuries on UK 
roads {X; = Yı — Yi-12}, 
showing the fitted 

GLS regression line. 


updated to 
X, = —328.458, + N,, (6.6.18) 


with the fitted MA(12) model for the residuals also displayed. After several iterations 
(each iteration is performed by pressing the MLE button) we arrive at the model 


X, = —328.45g, + N,, (6.6.19) 
with 
N, = Z, + .219Z,-1 + -098Z,—2 + .031Z,-3 + .064Z,_4 + .069Z,-s + .111Z,—6 
+ .081Z,_7 + .057Z,_3 + .092Z,—9 — .028Z,_19 + .183Z,_1; — -627 Z,—12, 


where {Z,} ~ WN(O, 12, 581). The estimated standard deviation of the regression 
coefficient estimator is 49.41, so the estimated coefficient, —328.45, is very signifi- 
cantly negative, indicating the effectiveness of the legislation. The differenced data 
are shown in Figure 6.19 with the fitted regression function. 


6.1. Suppose that {X,} is an ARIMA (p, d, q) process satisfying the difference equa- 
tions 


o(B)(1 — B)’ X, = 0(B)Z,  {Z,} ~ WN (0, o°). 
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6.2. 
6.3. 


6.4. 


6.5. 


6.6. 


6.7. 


6.8. 


Show that these difference equations are also satisfied by the process W, = 
X, + Ao + Ait +--+ + Ag1t4!, where Ao,..., Ag are arbitrary random 
variables. 


Verify the representation given in (6.3.4). 


Test the data in Example 6.3.1 for the presence of a unit root in an AR(2) model 
using the augmented Dickey—Fuller test. 


Apply the augmented Dickey—Fuller test to the levels of Lake Huron data 
(LAKE.TSM). Perform two analyses assuming AR(1) and AR(2) models. 


If {Y,} is a causal ARMA process (with zero mean) and if Xo is a random variable 
with finite second moment such that Xo is uncorrelated with Y, for each t = 
1, 2, ..., show that the best linear predictor of Y,,,,in terms of 1, Xo, ¥1,..., Yn 
is the same as the best linear predictor of Y,,,, in terms of 1, Y,,..., Y,. 


Let {X,} be the ARIMA(2,1,0) process satisfying 
(1 —0.8B + 0.25B°) VX, = Z, {Z} ~ WN(, 1). 


a. Determine the forecast function g(h) = P, Xn+n for h > 0. 
b. Assuming that n is large, compute o} (h) for h = 1,...,5. 


Use a text editor to create a new data set ASHORT.TSM that consists of the 

data in AIRPASS.TSM with the last twelve values deleted. Use ITSM to find an 

ARIMA model for the logarithms of the data in ASHORT.TSM. Your analysis 

should include 

a. a logical explanation of the steps taken to find the chosen model, 

b. approximate 95% bounds for the components of @ and 0, 

c. an examination of the residuals to check for whiteness as described in Sec- 
tion 1.6, 

d. a graph of the series ASHORT.TSM showing forecasts of the next 12 values 
and 95% prediction bounds for the forecasts, 

e. numerical values for the 12-step ahead forecast and the corresponding 95% 
prediction bounds, 

f. a table of the actual forecast errors, i.e.,, the true value (deleted from AIR- 
PASS.TSM) minus the forecast value, for each of the twelve forecasts. 

Does the last value of AIRPASS.TSM lie within the corresponding 95% pre- 

diction bounds? 


Repeat Problem 6.7, but instead of differencing, apply the classical decomposi- 
tion method to the logarithms of the data in ASHORT.TSM by deseasonalizing, 
subtracting a quadratic trend, and then finding an appropriate ARMA model 
for the residuals. Compare the twelve forecast errors found from this approach 
with those found in Problem 6.7. 
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6.9. 


6.10. 


6.11. 


6.12. 
6.13. 


Repeat Problem 6.7 for the series BEER.TSM, deleting the last twelve values 
to create a file named BSHORT.TSM. 


Repeat Problem 6.8 for the series BEER.TSM and the shortened series 
BSHORT.TSM. 


A time series {X,} is differenced at lag 12, then at lag 1 to produce a zero-mean 
series {Y,} with the following sample ACF: 


p(12j) © (.8)/, j =0,+1,+2,..., 
p(l2j £1) © (.4)(.8)/, 7 =0, +1, +2,..., 
p(h) ~ 0, otherwise, 
and y (0) = 25. 
a. Suggest a SARIMA model for {X,} specifying all parameters. 


b. For large n, express the one- and twelve-step linear predictors P, X„+ı and 
P,,Xn+12 in terms of X,, t = —12,—11,...,n, and Y, —Y,, t = 1,...,n. 
c. Find the mean squared errors of the predictors in (b). 


Use ITSM to verify the calculations of Examples 6.6.1, 6.6.2, and 6.6.3. 


The file TUNDRA.TSM contains the average maximum temperature over the 
month of February for the years 1895-1993 in an area of the USA whose vege- 
tation is characterized as tundra. 


a. Fit a straight line to the data using OLS. Is the slope of the line significantly 
different from zero? 
b. Find an appropriate ARMA model to the residuals from the OLS fit in (a). 


c. Calculate the MLE estimates of the intercept and the slope of the line and 
the ARMA parameters in (a). Is the slope of the line significantly different 
from zero? 


d. Use your model to forcast the average maximum temperature for the years 
1994 to 2004. 
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Multivariate Time Series 


7.1 Examples 

7.2 Second-Order Properties of Multivariate Time Series 

7.3 Estimation of the Mean and Covariance Function 

7.4 Multivariate ARMA Processes 

7.5 Best Linear Predictors of Second-Order Random Vectors 
7.6 Modeling and Forecasting with Multivariate AR Processes 
7.7 Cointegration 


Many time series arising in practice are best considered as components of some vector- 
valued (multivariate) time series {X,} having not only serial dependence within each 
component series {X,;} but also interdependence between the different component 
series {X,;} and {X,;}, i # j. Much of the theory of univariate time series extends in 
a natural way to the multivariate case; however, new problems arise. In this chapter 
we introduce the basic properties of multivariate series and consider the multivariate 
extensions of some of the techniques developed earlier. In Section 7.1 we introduce 
two sets of bivariate time series data for which we develop multivariate models later 
in the chapter. In Section 7.2 we discuss the basic properties of stationary multi- 
variate time series, namely, the mean vector u = EX, and the covariance matrices 
D(A) = E(Xi4nX)) — up, h = 0, £1, +2,..., with reference to some simple exam- 
ples, including multivariate white noise. Section 7.3 deals with estimation of js and 
T(-) and the question of testing for serial independence on the basis of observations of 
X,,..., X,. In Section 7.4 we introduce multivariate ARMA processes and illustrate 
the problem of multivariate model identification with an example of a multivariate 
AR(1) process that also has an MA(1) representation. (Such examples do not exist in 
the univariate case.) The identification problem can be avoided by confining attention 
to multivariate autoregressive (or VAR) models. Forecasting multivariate time series 
with known second-order properties is discussed in Section 7.5, and in Section 7.6 
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7.1 


Examples 


we consider the modeling and forecasting of multivariate time series using the multi- 
variate Yule—Walker equations and Whittle’s generalization of the Durbin—Levinson 
algorithm. Section 7.7 contains a brief introduction to the notion of cointegrated time 
series. 


In this section we introduce two examples of bivariate time series. A bivariate time 
series is a series of two-dimensional vectors (X;1, X;2)’ observed at times f¢ (usually 
t = 1,2,3,...). The two component series {X,,} and {X,2} could be studied inde- 
pendently as univariate time series, each characterized, from a second-order point 
of view, by its own mean and autocovariance function. Such an approach, however, 
fails to take into account possible dependence between the two component series, and 
such cross-dependence may be of great importance, for example in predicting future 
values of the two component series. 

We therefore consider the series of random vectors X, = (X;,, Xn) and define 
the mean vector 


se _ | EX 
lu := EX, = Ei 


and covariance matrices 


T(t +A, t) := Cov(X, 4n, X;) = Eoi Xa) COV(X;4h.1 | 


COV(X14n,2, X1) COV(Xi40,2, X12) 


The bivariate series {X,} is said to be (weakly) stationary if the moments u, and 
T(t +h, t) are both independent of t, in which case we use the notation 


een EG 
a= Ea 


and 


vulh) yalh) 
PSR A= po sd l 
The diagonal elements are the autocovariance functions of the univariate series {X;1} 
and {X,} as defined in Chapter 2, while the off-diagonal elements are the covariances 
between X,,;,; and X,;,i A j. Notice that yi2 (h4) = y2)(—A). 
A natural estimator of the mean vector pz in terms of the observations X,,..., X, 
is the vector of sample means 


= iR 
X, = ee 
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Example 7.1.1 


Figure 7-1 

The Dow Jones Index 
(top) and Australian 

All Ordinaries Index 
(bottom) at closing on 
251 trading days ending 
August 26th, 1994. 


and a natural estimator of T (A) is 


ae mS (Kin Xs) (X-X) for sh ent, 


T(—hy’ for —n+1<h<0O. 


The correlation p;;(h) between X,+;,; and X, j is estimated by 
bij h) = Pij M (Vii OP; (0). 


If i = j, then ;; reduces to the sample autocorrelation function of the ith series. 
These estimators will be discussed in more detail in Section 7.2. 


Dow Jones and All Ordinaries Indices; DJAO2.TSM 


Figure 7.1 shows the closing values Do, . . . , D259 of the Dow Jones Index of stocks on 
the New York Stock Exchange and the closing values Ao, ..., A250 of the Australian 
All Ordinaries Index of Share Prices, recorded at the termination of trading on 251 
successive trading days up to August 26th, 1994. (Because of the time difference 
between Sydney and New York, the markets do not close simultaneously in both 
places; however, in Sydney the closing price of the Dow Jones index for the previous 
day is known before the opening of the market on any trading day.) The efficient 
market hypothesis suggests that these processes should resemble random walks with 
uncorrelated increments. In order to model the data as a stationary bivariate time 
series we first reexpress them as percentage relative price changes or percentage 
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Figure 7-2 

The sample ACF #1; of the 
observed values of {Xa} in 
Example 7.1.1, showing 
the bounds +1.96n7"”. 
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The estimators ô: (h) and 62(h) of the autocorrelations of the two univariate series 
are shown in Figures 7.2 and 7.3. They are not significantly different from zero. 

To compute the sample cross-correlations 6)2(h) and 62; (h) using ITSM, select 
File>Project>Open>Multivariate. Then click OK and double-click on the file 
name DJAOPC2.TSM. You will see a dialog box in which Number of columns 
should be set to 2 (the number of components of the observation vectors). Then click 
OK, and the graphs of the two component series will appear. To see the correlations, 
press the middle yellow button at the top of the ITSM window. The correlation 
functions are plotted as a 2 x 2 array of graphs with ô1ı (4), 612(/) in the top row and 
(21(h), P22(h) in the second row. We see from these graphs (shown in Figure 7.4) that 
although the autocorrelations ô; (hA), i = 1,2, are all small, there is a much larger 
correlation between X,_;,; and X,». This indicates the importance of considering 
the two series jointly as components of a bivariate time series. It also suggests that 
the value of X,—1,1, i.e., the Dow Jones return on day t — 1, may be of assistance in 
predicting the value of X, 2, the All Ordinaries return on day t. This last observation 
is supported by the scatterplot of the points (4,-1.1, x12), t = 2,...,250, shown in 
Figure 7.5. 
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Figure 7-5 

Scatterplot of (X+-1,1, X42), 
t = 2,...,250, for the 
data in Example 7.1.1. 


Example 7.1.2 
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Sales with a leading indicator; LS2.TSM 


In this example we consider the sales data {Y,;.,t = 1,..., 150} with leading indi- 
cator {Y,;,f = 1,..., 150} given by Box and Jenkins (1976), p. 537. The two series 
are stored in the ITSM data files SALES.TSM and LEAD.TSM, respectively, and in 
bivariate format as LS2.TSM. The graphs of the two series and their sample auto- 
correlation functions strongly suggest that both series are nonstationary. Application 
of the operator (1 — B) yields the two differenced series {D,,} and {D,2}, whose 
properties are compatible with those of low-order ARMA processes. Using ITSM, 
we find that the models 


D, — 0228 = Z, — 474Z,11, {Za} ~ WN(O, .0779), (7.1.1) 
D, — .838D,_1,2 — .0676 = Zn — .610Z,_1 9, 
{Z2} ~ WN(0, 1.754), (7.1.2) 


provide good fits to the series {D,:} and {D,2}. 

The sample autocorrelations and cross-correlations of {D,;} and {D,2}, are com- 
puted by opening the bivariate ITSM file LS2.TSM (as described in Example 7.1.1). 
The option Transform>Difference, with differencing lag equal to 1, generates the 
bivariate differenced series {(D,,, D,;2)}, and the correlation functions are then ob- 
tained as in Example 7.1.1 by clicking on the middle yellow button at the top of the 
ITSM screen. The sample auto- and cross-correlations ĝ;; (h), i, j = 1,2, are shown 
in Figure 7.6. As we shall see in Section 7.3, care must be taken in interpreting the 
cross-correlations without first taking into account the autocorrelations of {D,,} and 


{D,2}. 
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7.2 Second-Order Properties of Multivariate Time Series 


Consider m time series {X;;, t = 0, +1,...,}, i = 1,...,m, with EX?, < œ for all 
t and i. If all the finite-dimensional distributions of the random variables {X,;} were 
multivariate normal, then the distributional properties of {X,;} would be completely 
determined by the means 


Hti t= EX;,; (7.2.1) 
and the covariances 
Vij (t + h, t) = E[(Xi+h i — bi) (Xr — iy]. (7.2.2) 


Even when the observations { X,;} do not have joint normal distributions, the quantities 
Lii and y;;(t +h, t) specify the second-order properties, the covariances providing us 
with a measure of the dependence, not only between observations in the same series, 
but also between the observations in different series. 

It is more convenient in dealing with m interrelated series to use vector notation. 
Thus we define 


(7.2.3) 
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The second-order properties of the multivariate time series {X,} are then specified by 
the mean vectors 


Mr 
m= EX, = : (7.2.4) 


ea 


and covariance matrices 


[metan ais ral +h | 
T +h, t) := : = : l (1.2.5) 

e o Ea 
where 


Vij (t +h, t) := COV (X 4h, X1,;)- 


Remark 1. The matrix T(t + h, t) can also be expressed as 


Pt +h, t) = EX an — Hn Xr — oe)’, 
where as usual, the expected value of a random matrix A is the matrix whose com- 
ponents are the expected values of the components of A. 


As in the univariate case, a particularly important role is played by the class of 
multivariate stationary time series, defined as follows. 


Definition 7.2.1 The m-variate series {X,} is (weakly) stationary if 
(i) jpx(f) is independent of t 
and 


(ii) Tx(t +h, t) is independent of t for each A. 


For a stationary time series we shall use the notation 


Hı 
po:= EX, = | : | (7.2.6) 
Hm 


and 


ee ul 
P(A) := EX, n — HX; — ps) = : E : ; (1.2.7) 
Ymı (h) cons Vinm (A) | 


We shall refer to u as the mean of the series and to F (A) as the covariance matrix at 
lag h. Notice that if {X,} is stationary with covariance matrix function T (-), then for 
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Example 7.2.1 


Proof 


each i, {X,;} is stationary with covariance function y;;(-). The function y,;(-),i Æ j, 
is called the cross-covariance function of the two series {X,;} and {X,;}. It should be 
noted that y,;(-) is not in general the same as y;;(-). The correlation matrix function 
R(.) is defined by 
pulh) go Pim(h) 
R(h) := : tha : F (7.2.8) 
Pm1 (h) ii Pmm (h) 
where p;j(h) = y;;(h)/[y%i(O)y;;(0)]'/”. The function R(-) is the covariance matrix 


function of the normalized series obtained by subtracting yz from X, and then dividing 
each component by its standard deviation. 


Consider the bivariate stationary process {X,} defined by 
Xi =Z, 
Xn = Z, + 0.75Z;_10, 


where {Z,} ~ WN(O, 1). Elementary calculations yield u = 0, 
0 0.75 1 1 0 0 
BN F k n= EF ‘ash EUIS Fe Al 


and T (j) = 0 otherwise. The correlation matrix function is given by 


; 1 0, 
R10 =| j eae RO =| 5 Pak R00) =| a 69 ous | 


and R(j) = 0 otherwise. 


Basic Properties of T (-): 
1. T(h) = T'(—h), 
2. [yy h) < [ya Oy; O], i, 7, = 1,...,m, 
3. y;(-) is an autocovariance function, i = 1,...,m, and 


4. ie ar —k)a, > Oforalln € {1,2,...}anda,,...,a, E€ R”. 


The first property follows at once from the definition, the second from the fact that 
correlations cannot be greater than one in absolute value, and the third from the 
observation that y;;(-) is the autocovariance function of the stationary series {X,;, t = 
0, +1,...}. Property 4 is a statement of the obvious fact that 


n 2 
E(D- w) > 0. 5 
j=1 
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Remark 2. The basic properties of the matrices (A) are shared also by the cor- 
responding matrices of correlations R(h) = [p;;(A)];" = which have the additional 


property 
pi(0)=1 for alli. 
The correlation p;;(0) is the correlation between X,; and X,;, which is generally not 


equal to 1 if i # j (see Example 7.2.1). It is also possible that |y,;;(h)| > |v; (O)| if 
i Æ j (see Problem 7.1). 


The simplest multivariate time series is multivariate white noise, the definition 
of which is quite analogous to that of univariate white noise. 


Definition 7.2.2 The m-variate series {Z,} is called white noise with mean 0 and covariance 
matrix X, written 


{Z} ~ WNO, %), (7.2.9) 
if {Z,} is stationary with mean vector 0 and covariance matrix function 
Y, ifh =O, 


Tr(h) = (7.2.10) 
0, otherwise. 


Definition 7.2.3 The m-variate series {Z,} is called iid noise with mean 0 and covariance matrix 
X, written 
{Z,} ~ 1d(0, }), (7.2.11) 


if the random vectors {Z,} are independent and identically distributed with mean 
0 and covariance matrix Ẹ. 


Multivariate white noise {Z,} is used as a building block from which can be 
constructed an enormous variety of multivariate time series. The linear processes are 
generated as follows. 


Definition 7.2.4 The m-variate series {X,} is a linear process if it has the representation 
X,= $. CZ,  {Z,) ~ WNO, 3), (7.2.12) 
j=% 


where {C;} is a sequence of m x m matrices whose components are absolutely 
summable. 
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The linear process (7.2.12) is stationary (Problem 7.2) with mean 0 and covari- 
ance function 


PA= Caa kathe (7.2.13) 


j=-% 


An MA(ov) process is a linear process with C; = 0 for j < 0. Thus {X,} is an 
MA (c0) process if and only if there exists a white noise sequence {Z,} and a sequence 
of matrices C; with absolutely summable components such that 


X, = Dy Cj Z,-;. 
j=0 


Multivariate ARMA processes will be discussed in Section 7.4, where it will 
be shown in particular that any causal ARMA(p, q) process can be expressed as an 
MA(oo) process, while any invertible ARMA(p, q) process can be expressed as an 
AR(co) process, i.e.,, a process satisfying equations of the form 


X,+ A;X-j =Z, 


j=l 


in which the matrices A; have absolutely summable components. 


Second-Order Properties in the Frequency Domain 


Provided that the components of the covariance matrix function T (-) have the property 
Pca Vii (A) < 00,1, j = 1,...,m, then T has a matrix-valued spectral density 
function 


1 = —idh 
fa = TAT RU -w <A <x, 
and T can be expressed in terms of f as 


T(h) = f i e! F(à)dàÀ. 


The second-order properties of the stationary process {X,} can therefore be described 
equivalently in terms of f(-) rather than T (-). Similarly, {X,} has a spectral repre- 
sentation 


X, = / eM dZ (A), 


T 


where {Z (å), —m < A < m} is a process whose components are complex-valued 
processes satisfying 


= firajda ifA=p, 
E (dZ;(A)dZ,(u)) = 


ifr £p, 
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and Z; denotes the complex conjugate of Zz. We shall not go into the spectral repre- 
sentation in this book. For details see TSTM. 


7.3 Estimation of the Mean and Covariance Function 


Proposition 7.3.1 


As in the univariate case, the estimation of the mean vector and covariances of a 
stationary multivariate time series plays an important role in describing and model- 
ing the dependence structure of the component series. In this section we introduce 
estimators, for a stationary m-variate time series {X,}, of the components uj, yij (h), 
and p;;(h) of p, T (h), and R(h), respectively. We also examine the large-sample 
properties of these estimators. 


7.3.1 Estimation of pw 


A natural unbiased estimator of the mean vector u based on the observations 
X,,...,X, is the vector of sample means 


_ 1 <2 
Reng 


The resulting estimate of the mean of the jth time series is then the univariate sample 
mean (1/n) }°"_, X,;. If each of the univariate autocovariance functions y;;(-), i = 
1,...,m, satisfies the conditions of Proposition 2.4.1, then the consistency of the 
estimator X, can be established by applying the proposition to each of the component 
time series {X,;}. This immediately gives the following result. 


If {X;} is a stationary multivariate time series with mean p and covariance function 
T(-), then asn — ov, 


E(X,- n) (K-pn) > 0 if y(n) > 0, 1<i<m, 


and 


nE (Xi —n) (S—n) > OY wal) if YO yu) <0, <i sm. 
i=] h=—oo 


— h=—oo 


Under more restrictive assumptions on the process {X,} it can also be shown that 
X,, is approximately normally distributed for large n. Determination of the covariance 
matrix of this distribution would allow us to obtain confidence regions for jz. However, 
this is quite complicated, and the following simple approximation is useful in practice. 

For each i we construct a confidence interval for u; based on the sample mean X; 
of the univariate series X,;,..., X,; and combine these to form a confidence region 
for u. If f; (œ) is the spectral density of the ith process {X,;} and if the sample size n is 
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large, then we know, under the same conditions as in Section 2.4, that y/n (X; — hi) 
is approximately normally distributed with mean zero and variance 


2nfi(0)= $ yik). 
k=—00 
It can also be shown (see, e.g., Anderson, 1971) that 
Raa |nl\ . 
2x f (0) := > (1- T ) muh) 
|h|sr 


is a consistent estimator of 27 f;(0), provided that r = r, is a sequence of numbers 
depending on n in such a way that r, > oo andr,/n > 0 as n > œ. Thus if X; 
denotes the sample mean of the ith process and ®, is the -quantile of the standard 
normal distribution, then the bounds 


Z t orap (rOn) 


are asymptotic (1 — œ) confidence bounds for u;. Hence 


_ 2 1/2 
P(t 2X Sow (27 /,@/n) ge I,m) 


m 


= 7 1/2 

>1-)°P (\u oE as (2 #,)/n) ) l 
i=l 

where the right-hand side converges to 1 — ma as n — oo. Consequently, as n —> ov, 


the set of m-dimensional vectors bounded by 


= x 1/2 
E SK E Deo (27 f,)/n) eias m| (7.3.1) 
has a confidence coefficient that converges to a value greater than or equal to 1 —a 


(and substantially greater if m is large). Nevertheless, the region defined by (7.3.1) is 
easy to determine and is of reasonable size, provided that m is not too large. 


7.3.2 Estimation of (A) 


As in the univariate case, a natural estimator of the covariance r (h) = E [Xin — 
H)(X — p) Jis 


n Y (Xan Xn) (X; -X,) for0<h<n—l1, 


Î'(—h) for —-n+1<h <0. 


236 


Chapter 7 


Multivariate Time Series 


Theorem 7.3.1 


Writing ĵ;; (h) for the (i, 7)-component of rn), i,j = 1,2,..., we estimate the 
cross-correlations by 


Dih) = fy Pi OP(O). 


If i = j, then ĝ;j reduces to the sample autocorrelation function of the ith series. 

Derivation of the large-sample properties of 7;; and (;; is quite complicated in 
general. Here we shall simply note one result that is of particular importance for 
testing the independence of two component series. For details of the proof of this and 
related results, see TSTM. 


Let {X;} be the bivariate time series whose components are defined by 


Xn = 5 akZiki {Zu} ~ OD (0, oF), 


k=—00 


and 


Xn = y Pk Zt-k.2, {Zn} ~ ID (0, 03) ? 


k=—00 


where the two sequences {Z,,} and {Zp} are independent, )°,|a,| < œ, and 


dix [Bil < 00. 


Then for all integers h and k with h +Æ k, the random variables n!’ ĝu (h) 
and n'/?(42(k) are approximately bivariate normal with mean 0, variance 
P-o P11 (J) O22(7), and covariance YY". Pi(j)2(j +k — h), for n large. 


[For a related result that does not require the independence of the two series {X1} 
and {X n} see Theorem 7.3.2 below.| 


Theorem 7.3.1 is useful in testing for correlation between two time series. If one 
of the two processes in the theorem is white noise, then it follows at once from the 
theorem that 6)2(h) is approximately normally distributed with mean 0 and variance 
1/n, in which case it is straightforward to test the hypothesis that p).(1) = 0. However, 
if neither process is white noise, then a value of 6,2(h) that is large relative ton~!/* does 
not necessarily indicate that p12(4) is different from zero. For example, suppose that 
{X,,} and {X,2} are two independent AR(1) processes with p11 (h) = p22(h) = .8!"!. 
Then the large-sample variance of /12(h) is n™! (1 +2 7%, (.64)*) = 4.556n7!. It 
would therefore not be surprising to observe a value of 612(h) as large as 3n7'/? 
even though {X,ı} and {X2} are independent. If on the other hand, pı (A) = .8!"' 
and py(h) = (—.8)!"|, then the large-sample variance of 6;2(h) is .2195n7', and an 
observed value of 3n~!/? for 612(h) would be very unlikely. 
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Example 7.3.1 


7.3.3 Testing for Independence of Two Stationary Time Series 


Since by Theorem 7.3.1 the large-sample distribution of ò12 (h) depends on both p11 (-) 
and /(-), any test for independence of the two component series cannot be based 
solely on estimated values of p,2(h), h = 0, +1, ..., without taking into account the 
nature of the two component series. 

This difficulty can be circumvented by “prewhitening” the two series before 
computing the cross-correlations ()2(h), i.e., by transforming the two series to white 
noise by application of suitable filters. If {X,,} and {X,2} are invertible ARMA (p, q) 
processes, this can be achieved by the transformations 


CO 
© 
T Dom Xi jis 
j=0 


where ae ie my 2d = 6 (z)/0® (z) and ¢, 6 are the autoregressive and moving- 
average polynomials of the ith series, i = 1, 2. 
Since in practice the true model is nearly always unknown and since the data X,;, 
< 0, are not available, it is convenient to replace the sequences {Z,;} by the residuals 

i} after fitting a maximum likelihood ARMA model to each of the component 
eries (see (5.3.1)). If the fitted ARMA models were in fact the true models, the series 
Wii } would be white noise sequences for i = 1, 2. 

To test the hypothesis Ho that {X,,} and {X,2} are independent series, we observe 
that under Ho, the corresponding two prewhitened series {Z,,} and {Z,2} are also inde- 
pendent. Theorem 7.3.1 then implies that the sample cross-correlations 612(1), 012(k), 
h#k, of {Z} and {Z,2} are for large n approximately independent and normally dis- 
tributed with means 0 and variances n~!. An approximate test for independence can 
therefore be obtained by comparing the values of |2 (A)| with 1.96n~'”, exactly as 
in Section 5.3.2. If we prewhiten only one of the two original series, say {X,,}, then 
under Ho Theorem 7.3.1 implies that the sample cross-correlations (2(h), (12(k), 
h Æ k, of {Z} and {X,2} are for large n approximately normal with means 0, vari- 
ances n~! and covariance n~! p(k — h), where p22(-) is the autocorrelation function 
of {Xj}. Hence, for any fixed h, (12(h) also falls (under Ho) between the bounds 
+1.96n~'/? with a probability of approximately 0.95. 


(w 
{ 


The sample correlation functions ĝ;;(-), i, j = 1,2, of the bivariate time series 
E731A.TSM (of length n = 200) are shown in Figure 7.7. Without taking into 
account the autocorrelations (;;(-), i = 1, 2, it is impossible to decide on the basis of 
the cross-correlations whether or not the two component processes are independent 
of each other. Notice that many of the sample cross-correlations ĝ;; (h), i Æ j, lie 
outside the bounds +1.96n~'/? = +.139. However, these bounds are relevant only if 
at least one of the component series is white noise. Since this is clearly not the case, 
a whitening transformation must be applied to at least one of the two component se- 
ries. Analysis using ITSM leads to AR(1) models for each. The residuals from these 
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Corollary 7.3.1 


Example 7.3.2 


maximum likelihood models are stored as a bivariate series in the file E731B.TSM, 
and their sample correlations, obtained from ITSM, are shown in Figure 7.8. All but 
two of the cross-correlations are between the bounds +.139, suggesting by Theorem 
7.3.1 that the two residual series (and hence the two original series) are uncorrelated. 
The data for this example were in fact generated as two independent AR(1) series 
with @ = 0.8 and o? = 1. 


7.3.4 Bartlett’s Formula 


In Section 2.4 we gave Bartlett’s formula for the large-sample distribution of the 
sample autocorrelation vector p = (A(1), Lites p(k)) of a univariate time series. 
The following theorem gives a large-sample approximation to the covariances of the 
sample cross-correlations 6;2(h) and (12(k) of the bivariate time series {X,} under the 
assumption that {X,} is Gaussian. However, it is not assumed (as in Theorem 7.3.1) 
that {X,,} is independent of {X,2}. 


Bartlett’s Formula: 
If {X,} is a bivariate Gaussian time series with covariances satisfying 
oo Vii (A)| < œ, i, j = 1, 2, then 


(oe) 


lim nCov(A12(h), bik) = Do [nD tk —h) + pnl +k)pan(j — h) 


j=—00 


— puh pao +k) + px2l(Jpal — kK} 


— prlk){ouC ew + h) + p22(j) pa (j — hy} 


1 Fe aime Fai 1 2,7. 
+ pi2(h) p12(k) xP) + Pi(J) 4 5 P22) 


If {X,} satisfies the conditions for Bartlett’s formula, if either {X} or {X2} is white 
noise, and if 


pPi2(h)=0, h¢ [a,b], 
then 


lim nVar (ô2(h))=1, hg [a,b]. 


n> 


Sales with a leading indicator 


We consider again the differenced series {D,ı} and {D,2} of Example 7.1.2, for which 
we found the maximum likelihood models (7.1.1) and (7.1.2) using ITSM. The resid- 
uals from the two models (which can be filed by ITSM) are the two “whitened” series 
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Figure 7-7 

The sample correlations 
of the bivariate series 
E731A.TSM of Example 
7.3.1, showing the 
bounds +1.96n7 "2. 


Figure 7-8 

The sample correlations 
of the bivariate series of 
residuals E731B.TSM, 
whose components are 
the residuals from the 
AR(1) models fitted to 
each of the component 
series in E731A.TSM. 
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Figure 7-9 

The sample correlations 
of the whitened series 
Weh and Wn of 
Example 7.3.2, showing 
the bounds 1.96n7". 


{Wi} and {Wi} with sample variances .0779 and 1.754, respectively. This bivariate 
series is contained in the file E732.TSM. 

The sample auto- and cross-correlations of { D,,} and {D,2} were shown in Figure 
7.6. Without taking into account the autocorrelations, it is not possible to draw any 
conclusions about the dependence between the two component series from the cross- 
correlations. 

Examination of the sample cross-correlation function of the whitened series 
{W, 1} and {Wo}, on the other hand, is much more informative. From Figure 7.9 
it is apparent that there is one large-sample cross-correlation (between W,+3,2 and 
W, 1), while the others are all between +1.96n71/2. 


If {Wi} and {Wi} are assumed to be jointly Gaussian, Corollary 7.3.1 indicates 
the compatibility of the cross-correlations with a model for which 


pn(—3) #0 
and 
p2o(h)=0, hA-3. 
The value 612(—3) = .969 suggests the model 
Wir = 4.74W,_3.1 + N,, (7.3.2) 


where the stationary noise {N,} has small variance compared with {Wi} and {Wi k 
and the coefficient 4.74 is the square root of the ratio of sample variances of {Wi 
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and {Wi }. A study of the sample values of {Wis = 4.74W,-3,1} suggests the model 
(1+ .345B)N, = U,, {U;} ~ WN(O, .0782) (7.3.3) 


for {N,}. Finally, replacing Wr and Ŵ, 3.1 in (7.3.2) by Zn and Z,_3,1, respectively, 
and then using (7.1.1) and (7.1.2) to express Z, and Z,-3, in terms of {D,2} and 
{D,,}, we obtain a model relating {D,,}, {D,2}, and {U,,}, namely, 


D,. + .0773 = (1 — .610B)(1 — .838B)~'[4.74(1 — .474B)|D,_3, 
+ (1+ .345B)'U,]. 


This model should be compared with the one derived later in Section 10.1 by the 
more systematic technique of transfer function modeling. 


7.4 Multivariate ARMA Processes 


Definition 7.4.1 


Example 7.4.1 


As in the univariate case, we can define an extremely useful class of multivariate sta- 
tionary processes {X,} by requiring that {X,} should satisfy a set of linear difference 
equations with constant coefficients. Multivariate white noise {Z,} (see Definition 
7.2.2) is a fundamental building block from which these ARMA processes are con- 
structed. 


{X,} is an ARMA (p, q) process if {X,} is stationary and if for every t, 
X, — ©, X,_; —---— DX p = Z, + OZ- +---+0,Z,-,, (7.4.1) 


where {Z;} ~ WN(O, X). ({X;} isan ARMA (p, q) process with mean pif {X,— u} 
is an ARMA (p, q) process.) 


Equations (7.4.1) can be written in the more compact form 
O(B)X, = O(B)Z,, {Z} ~ WNO, Ẹ), (7.4.2) 


where ®(z) := I — ®)z —--- — ®,z? and O(z) := 1+ Oiz +- + O,2z! are 
matrix-valued polynomials, Z is the m x m identity matrix, and B as usual denotes the 
backward shift operator. (Each component of the matrices ® (z), ©(z) is a polynomial 
with real coefficients and degree less than or equal to p, q, respectively.) 


The multivariate AR(1) process 


Setting p = 1 and q = 0 in (7.4.1) gives the defining equations 
X, = ÒX +Z, {Z} ~ WNO, %), (7.4.3) 
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for the multivariate AR(1) series {X;}. By exactly the same argument as used in 
Example 2.2.1, we can express X, as 


X, = X DZ, (1.4.4) 
j=0 


provided that all the eigenvalues of ® are less than 1 in absolute value, i.e.,, provided 
that 


det(7 — z@) #0 for all z € C such that |z| < 1. (7.4.5) 


If this condition is satisfied, then the coefficients &/ are absolutely summable, and 
hence the series in (7.4.4) converges; i.e., each component of the matrix ae İZ j 
converges (see Remark 1 of Section 2.2). The same argument as in Example 2.2.1 also 
shows that (7.4.4) is the unique stationary solution of (7.4.3). The condition that all 
the eigenvalues of ® should be less than 1 in absolute value (or equivalently (7.4.5)) 
is just the multivariate analogue of the condition |¢| < 1 required for the existence 
of a causal stationary solution of the univariate AR(1) equations (2.2.8). 


Causality and invertibility of a multivariate ARMA(p, q) process are defined 
precisely as in Section 3.1, except that the coefficients y;, 7; in the representations 
xX, = pe Wily and Z, = Yi m;X,-; are replaced by m x m matrices W; 
and I; whose components are required to be absolutely summable. The following 
two theorems (proofs of which can be found in TSTM) provide us with criteria for 
causality and invertibility analogous to those of Section 3.1. 


Causality: 


An ARMA(p, q) process {X,} is causal, or a causal function of {Z,}, if there 
exist matrices {W;} with absolutely summable components such that 


X,= 9 WZ; forall. (7.4.6) 
j=0 
Causality is equivalent to the condition 
det ®(z) Æ 0 for all z € C such that |z| < 1. (7.4.7) 


The matrices Y, are found recursively from the equations 


Wj, =O;+) OW, 7 =0,1,..., (7.4.8) 
k=l 
where we define ©) = 7, ©; = 0 for j > q, ®; = 0 for j > p, and Y; = 0 for 
j <0. 
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Invertibility: 
An ARMA(p, q) process {X,} is invertible if there exist matrices {I1;} with ab- 
solutely summable components such that 
Z, =) 11;X,_; for all t. (7.4.9) 
j=0 
Invertibility is equivalent to the condition 
det @(z) Æ 0 for all z € C such that |z| < 1. (7.4.10) 
The matrices I; are found recursively from the equations 
Nj =- -9 Nje j=0,1,..., (7.4.11) 
k=1 
where we define ®) = —/, ®; = 0 for j > p, ©; =0 for j > q, and TI; = 0 for 
j <9. 
Example 7.4.2 For the multivariate AR(1) process defined by (7.4.3), the recursions (7.4.8) give 


Y=, 
VU, = OW = È, 
Y, = OW, = 0’, 


y; = PY; = bi, j>3, 


as already found in Example 7.4.1. 


Remark 3. For the bivariate AR(1) process (7.4.3) with 


0 0.5 
elo 6 


it is easy to check that Y; = ®/ = 0 for j > 1 and hence that {X;} has the alternative 
representation 


X, = Z, + Z, 


as an MA(1) process. This example shows that it is not always possible to distinguish 
between multivariate ARMA models of different orders without imposing further 
restrictions. If, for example, attention is restricted to pure AR processes, the prob- 
lem does not arise. For detailed accounts of the identification problem for general 
ARMA(p, q) models see Hannan and Deistler (1988) and Liitkepohl (1993). 
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7.4.1 The Covariance Matrix Function of a Causal ARMA Process 


From (7.2.13) we can express the covariance matrix T (h) = E(X,+;,X/) of the causal 
process (7.4.6) as 


Th) = Warp PV;, h=0,+1,..., (7.4.12) 
j=0 
where the matrices Y; are found from (7.4.8) and Y; := 0 for j < 0. 
The covariance matrices T (4), h = 0, +1, ..., can also be found by solving the 
Yule—Walker equations 


P 
rG- TG -r)= $ ©,$Y,j, j=0,1,2,..., (7.4.13) 
r=1 jsr<q 
obtained by postmultiplying (7.4.1) by X;_ ; and taking expectations. The first p+ 1 of 
the equations (7.4.13) can be solved for the components of TF (0), ..., r (p) using the 
fact that r (—h4) = I'’(h). The remaining equations then give r (p + 1), T (p +2,... 
recursively. An explicit form of the solution of these equations can be written down by 
making use of Kronecker products and the vec operator (see e.g., Liitkepohl, 1993). 


Remark 4. If zo is the root of det ®(z) = 0 with smallest absolute value, then it 
can be shown from the recursions (7.4.8) that U;/r/ — 0 as j > oo for all r such 
that |zo|~' < r < 1. Hence, there is a constant C such that each component of Y; 
is smaller in absolute value than Cr/. This implies in turn that there is a constant K 
such that each component of the matrix Y;,,; Jy; on the right of (7.4.12) is bounded 
in absolute value by Kr”. Provided that |zo| is not very close to 1, this means that 
the series (7.4.12) converges rapidly, and the error incurred in each component by 
truncating the series after the term with j = k — 1 is smaller in absolute value than 
E Kr’ = Kr*/ (1 - r°). 


7.5 Best Linear Predictors of Second-Order Random Vectors 


Let {X, = (Xz, ---, Xım)'} be an m-variate time series with means EX, = u, and 
covariance function given by the m x m matrices 


K(i, j) = E (XiX,) — wie’, 


If Y = (%,..., Ymy is a random vector with finite second moments and EY = p, 
we define 
P,(Y) =(P; Yieee Pain)’, (7.5.1) 


where P,,Y; is the best linear predictor of the component Y, of Y in terms of all 
of the components of the vectors X,,¢ = 1,...,m, and the constant 1. It follows 
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immediately from the properties of the prediction operator (Section 2.5) that 


for some matrices A;,..., An, and that 
Y— PCY) L Xni, i=1,...,n, (7.5.3) 


where we say that two m-dimensional random vectors X and Y are orthogonal (written 
X L Y)if E(XY’) is a matrix of zeros. The vector of best predictors (7.5.1) is uniquely 
determined by (7.5.2) and (7.5.3), although it is possible that there may be more than 
one possible choice for Aj, ..., An. 

Asa special case of the above, if {X,} is a zero-mean time series, the best linear 


predictor X,,,, of X„+ı in terms of X,;,..., X, is obtained on replacing Y by X,,., in 
(7.5.1). Thus 

a 0, ifn = 0, 

X41 = ‘ 

P,(Xn41), ifn Z 1. 

Hence, we can write 

Xai = Ppp Xn +++ + PnnX1, n = 1, PE (7.5.4) 
where, from (7.5.3), the coefficients ®,;, j = 1,...,n, are such that 

E (XniXh 1) =E(XeiXu14), t=lean, (7.5.5) 


X OyKat1—jnt+l-i=Kt+ln+1—-i), i=1,...,n. 
j=l 


In the case where {X,} is stationary with K (i, j) = T (i — j), the prediction equations 
simplify to the m-dimensional analogues of (2.5.7), i.e., 


y oepa r0 TS p (7.5.6) 
j=1 
Provided that the covariance matrix of the nm components of X,,..., X„ is nonsin- 


gular for every n > 1, the coefficients {®,;} can be determined recursively using 
a multivariate version of the Durbin—Levinson algorithm given by Whittle (1963) 
(for details see TSTM, Proposition 11.4.1). Whittle’s recursions also determine the 
covariance matrices of the one-step prediction errors, namely, Vo = T (0) and, for 
n> l, 


Vp = E Rust = Kyi) Xr — Kaa) 


= TO) — Par (1) —--- — PF (—n). (7.5.7) 
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Remark 5. The innovations algorithm also has a multivariate version that can be 
used for prediction in much the same way as the univariate version described in 
Section 2.5.2 (for details see TSTM, Proposition 11.4.2). 


7.6 Modeling and Forecasting with Multivariate AR Processes 


If {X,} is any zero-mean second-order multivariate time series, it is easy to show from 
the results of Section 7.5 (Problem 7.4) that the one-step prediction errors X; — Xj, 
j =1,...,n, have the property 


E(X;—X}) (Xi -%&) =Ofor j #k. (7.6.1) 
Moreover, the matrix M such that 
X, — X X, 
X- X? X2 
X; —- X; | = M| X (7.6.2) 


[zig] [x 


is lower triangular with ones on the diagonal and therefore has determinant equal 
to 1. 

If the series {X,} is also Gaussian, then (7.6.1) implies that the prediction errors 
U; = X;—-X,, j =1,...,n, are independent with covariance matrices Vo, ..., V,—1, 
respectively (as specified in (7.5.7)). Consequently, the joint density of the prediction 
errors is the product 


n =1/2 n 
f(uy,...,U,) = Qa)? (I dv) exp 5 Sou; vzw | ; 
j=1 j=l 
Since the determinant of the matrix M in (7.6.2) is equal to 1, the joint density of the 
observations X;,..., X, at X1, .. . , Xn is obtained on replacing uy, ..., u, in the last 
expression by the values of X; — x j corresponding to the observations X),..., Xn. 

If we suppose that {X,} is a zero-mean m-variate AR(p) process with coefficient 
matrices ® = {®,,..., ®,} and white noise covariance matrix Ẹ, we can therefore 
express the likelihood of the observations X,,..., X,, as 


-1/2 
n 1 n ; 7 
L(®, $) = (ny? (I dav) exp -5 2U; v7 ; 

j=l j=l 


where U; = X; — X;, j=1,...,n, and Ñ, and V; are found from (7.5.4), (7.5.6), 
and (7.5.7). 
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Maximization of the Gaussian likelihood is much more difficult in the multivari- 
ate than in the univariate case because of the potentially large number of parameters 
involved and the fact that it is not possible to compute the maximum likelihood es- 
timator of ® independently of Ẹ as in the univariate case. In principle, maximum 
likelihood estimators can be computed with the aid of efficient nonlinear optimiza- 
tion algorithms, but it is important to begin the search with preliminary estimates that 
are reasonably close to the maximum. For pure AR processes good preliminary esti- 
mates can be obtained using Whittle’s algorithm or a multivariate version of Burg’s 
algorithm given by Jones (1978). We shall restrict our discussion here to the use of 
Whittle’s algorithm (the multivariate option AR-Model>Estimation>Yule-Walker 
in ITSM), but Jones’s multivariate version of Burg’s algorithm is also available 
(AR-Mode1>Estimation>Burg). Other useful algorithms can be found in Liitkepohl 
(1993), in particular the method of conditional least squares and the method of Han- 
nan and Rissanen (1982), the latter being useful also for preliminary estimation in the 
more difficult problem of fitting ARMA(p, q) models with q > 0. Spectral methods 
of estimation for multivariate ARMA processes are also frequently used. A discussion 
of these (as well as some time-domain methods) is given in Anderson (1980). 

Order selection for multivariate autoregressive models can be made by minimiz- 
ing a multivariate analogue of the univariate AICC statistic 


2(pm? + 1)nm 


nm — pm? —2° 


AICC = —21InL(®,,..., ®,, $) 4 


’ p’ 


(7.6.3) 


7.6.1 Estimation for Autoregressive Processes Using Whittle’s Algorithm 
If {X,} is the (causal) multivariate AR(p) process defined by the difference equations 
X, = X, +o + DX -p HZ, {Z} ~ WNO, ¥), (7.6.4) 


then postmultiplying by X;_;, j =0,..., p, and taking expectations gives the equa- 
tions 


Pp 
y=) -J roj) (7.6.5) 


j=l 


and 


Ey OPC Hg). PHL ep: (7.6.6) 
j=l 

Given the matrices T (0), ..., r (p), equations (7.6.6) can be used to determine the co- 
efficient matrices ®;,..., ®,. The white noise covariance matrix Ẹ can then be found 
from (7.6.5). The solution of these equations for ®;,..., ®,, and is identical to the 
solution of (7.5.6) and (7.5.7) for the prediction coefficient matrices ®p1,..., Ppp 
and the corresponding prediction error covariance matrix V,,. Consequently, Whittle’s 
algorithm can be used to carry out the algebra. 
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Example 7.6.1 


Example 7.6.2 


The Yule—Walker estimators ®,, sots ® p» and x for the model (7.6.4) fitted to 
the data X,,..., X, are obtained by replacing T (j) in (7.6.5) and (7.6.6) by T(J), 
j =0,..., p, and solving the resulting equations for ®;,..., ®,, and Ẹ. The solu- 


tion of these equations is obtained from ITSM by selecting the multivariate option 
AR-Model>Estimation>Yule-Walker. The mean vector of the fitted model is the 
sample mean of the data, and Whittle’s algorithm is used to solve the equations (7.6.5) 
and (7.6.6) for the coefficient matrices and the white noise covariance matrix. The 
fitted model is displayed by ITSM in the form 


X, = po + ©, X,_; +--+ ®,X,_,+ Z, {Z} ~ WN(0, J). 
Note that the mean p of this model is not the vector o, but 
w= (1-0, —---— ,) do. 
In fitting multivariate autoregressive models using ITSM, check the box Find 
minimum AICC model to find the AR(p) model with O < p < 20 that minimizes the 
AICC value as defined in (7.6.3). 


Analogous calculations using Jones’s multivariate version of Burg’s algorithm 
can be carried out by selecting AR-Model>Estimation>Burg. 


The Dow Jones and All Ordinaries Indices 


To find the minimum AICC Yule—Walker model (of order less than or equal to 20) 
for the bivariate series {(X,,, X2), t = 1,...,250} of Example 7.1.1, proceed as 
follows. Select File>Project>Open>Multivariate, click OK, and then double- 
click on the file name, DJAOPC2.TSM. Check that Number of columns is set to 2, 
the dimension of the observation vectors, and click OK again to see graphs of the two 
component time series. No differencing is required (recalling from Example 7.1.1 
that {X,;} and {X;.} are the daily percentage price changes of the original Dow Jones 
and All Ordinaries Indices). Select AR-Model>Estimation>Yule-Walker, check 
the box Find minimum AICC Model, click OK, and you will obtain the model 


Xi | _ | -0288 4 —.0148 .0357 || Xi-11 4 Zi 
Xn | | .00836 6589 = .0998 | | Xi_12 Zi |? 


Zi 0 3653 .0224 
[zws (loj [o2 eor6]): 


Sales with a leading indicator 


where 


The series {Y,,} (leading indicator) and {Y,2} (sales) are stored in bivariate form 
(Y in column 1 and Yp in column 2) in the file LS2.TSM. On opening this file 
in ITSM you will see the graphs of the two component time series. Inspection of 
the graphs immediately suggests, as in Example 7.2.2, that the differencing operator 
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V = 1—B should be applied to the data before a stationary AR model is fitted. Select 
Transform>Difference and specify 1 for the differencing lag. Click OK and you 
will see the graphs of the two differenced series. Inspection of the series and their 
correlation functions (obtained by pressing the second yellow button at the top of the 
ITSM window) suggests that no further differencing is necessary. The next step is to 
select AR-mode1>Estimation>Yule-Walker with the option Find minimum AICC 
model. The resulting model has order p = 5 and parameters o = (.0328 .0156)’, 


ô = e- .024 |.: = Ea 50 |e _ pee ae 


—.019 —.051 047.250 4.678.207 
= [7032 -.009]_g _[ 022 011] g _[ .076 -003 
45| 3.664 .004 |’ > | 1.300 .029]’* ~ | —.003 .095 |’ 


with AICC=109.49. (Analogous calculations using Burg’s algorithm give an AR(8) 
model for the differenced series.) The sample cross-correlations of the residual vectors 
Z, can be plotted by clicking on the last blue button at the top of the ITSM window. 
These are nearly all within the bounds +1.96/,/n, suggesting that the model is a 
good fit. The components of the residual vectors themselves are plotted by selecting 
AR Model>Residual Analysis>Plot Residuals. Simulated observations from 
the fitted model can be generated using the option AR Model>Simulate. The fitted 
model has the interesting property that the upper right component of each of the co- 
efficient matrices is close to zero. This suggests that {X,,} can be effectively modeled 
independently of {X,2}. In fact, the MA(1) model 


X= (1 —.474B)U,, {U,} ~ WN(O, .0779), (7.6.7) 


provides an adequate fit to the univariate series {X,,}. Inspecting the bottom rows of 
the coefficient matrices and deleting small entries, we find that the relation between 
{X} and {X,2} can be expressed approximately as 


Xn = .250X,-2,2 + .207X1-3,2 + 4.678 Xi-3,1 + 3.664X,_4,) + 1.300X,-5,1 + Wi, 
or equivalently, 


_ 4.678 B?(1 + .783B + .278 B°) W, 


= Xiuic , (7.6.8 
= 1 — .250B? — .207B? 1" 1 = .250B2 — .207B3 ( ) 


where {W,} ~ WN(0, .095). Moreover, since the estimated noise covariance matrix is 
essentially diagonal, it follows that the two sequences {X,,} and {W,} are uncorrelated. 
This reduced model defined by (7.6.7) and (7.6.8) is an example of a transfer function 
model that expresses the “output” series {X,2} as the output of a linear filter with 
“input” {X,,} plus added noise. A more direct approach to the fitting of transfer 
function models is given in Section 10.1 and applied to this same data set. 
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7.6.2 Forecasting Multivariate Autoregressive Processes 


The technique developed in Section 7.5 allows us to compute the minimum mean 
squared error one-step linear predictors X41 for any multivariate stationary time 
series from the mean u and autocovariance matrices I (A) by recursively determining 
the coefficients ®,;,i = 1,...,n, and evaluating 


Š, = Mt Oy (Ky — po) Ho + Opn (XK n). (7.6.9) 
The situation is simplified when {X,} is the causal AR(p) process defined by 
(7.6.4), since for n > p (as is almost always the case in practice) 
Xi = PX, oie Pp Xn+1-p- (7.6.10) 
To verify (7.6.10) it suffices to observe that the right-hand side has the required form 
(7.5.2) and that the prediction error 


X,+1 m X, a ® py Xn+1—p = Zn+1 


is orthogonal to X;,..., X, in the sense of (7.5.3). (In fact, the prediction error is 
orthogonal to all X;, —oo < j < n, showing that if n > p, then (7.6.10) is also the 
best linear predictor of X„+ı in terms of all components of X;, —oo < j < n.) The 
covariance matrix of the one-step prediction error is clearly E(Z,41:Z),,,) = X. 

To compute the best h-step linear predictor P,X„+n based on all the components 
of X;,..., X, we apply the linear operator P, to (7.6.4) to obtain the recursions 


PiXn+h = Pı PrXntn-1 ae yas Pp PrXnsh—p- (7.6.11) 


These equations are easily solved recursively, first for P,X,,,, then for P,X,+2, 
P,Xn13,--., etc. If n > p, then the h-step predictors based on all components of 
X;,—œ0 < j < n, also satisfy (7.6.11) and are therefore the same as the h-step 
predictors based on Xj, ..., Xn. 

To compute the h-step error covariance matrices, recall from (7.4.6) that 


Xn+h = x WijZitn—j> (7.6.12) 
j=0 


where the coefficient matrices Y; are found from the recursions (7.4.8) with q = 0. 
From (7.6.12) we find that for n > p, 


Pao X nyn = 5 Y;Zntn-j- (7.6.13) 
j=h 
Subtracting (7.6.13) from (7.6.12) gives the h-step prediction error 


h-1 
Xn+a z Pp, Xn+n = y Y;jZn+h-j» (7.6.14) 
j=0 


7.6 


Modeling and Forecasting with Multivariate AR Processes 251 


Example 7.6.3 


with covariance matrix 
h-1 
E [Kah _ PiXn+n) (Xanth 23 P, Xnr) | = > Uw TV", n= P. (7.6.15) 
j=0 


For the (not necessarily zero-mean) causal AR(:p) process defined by 
X, = Go + ÈX,- +- + Op~Xi-p +Z, {Z} ~ WNO, ¥), 


equations (7.6.10) and (7.6.11) remain valid, provided that po is added to each of their 
right-hand sides. The error covariance matrices are the same as in the case dy = 0. 

The above calculations are all based on the assumption that the AR(p) model for 
the series is known. However, in practice, the parameters of the model are usually 
estimated from the data, and the uncertainty in the predicted values of the series 
will be larger than indicated by (7.6.15) because of parameter estimation errors. See 
Liitkepohl (1993). 


The Dow Jones and All Ordinaries Indices 


The VAR(1) model fitted to the series {X,,¢ = 1,..., 250} in Example 7.6.1 was 


Xi, | _ | -0288 4 —.0148 .0357 | | Xi-11 as Zit 
X2 | | .00836 6589 = .0998 | | X;_12 Zi |? 


Za 0 .3653 0224 
[z~ wn (fo [i0224 soi |) 
The one-step mean squared error for prediction of X2, assuming the validity of this 
model, is thus 0.6016. This is a substantial reduction from the estimated mean squared 
error Y22(0) = .7712 when the sample mean /i2 = .0309 is used as the one-step pre- 
dictor. 


If we fit a univariate model to the series {X,2} using ITSM, we find that the 
autoregression with minimum AICC value (645.0) is 


Xn = 0273 + .1180X+-12+Z;, {Z} ~ WN(O, .7604). 


where 


Assuming the validity of this model, we thus obtain a mean squared error for one- 
step prediction of .7604, which is slightly less than the estimated mean squared error 
(.7712) incurred when the sample mean is used for one-step prediction. 

The preceding calculations suggest that there is little to be gained from the 
point of view of one-step prediction by fitting a univariate model to {X,2}, while 
there is a substantial reduction achieved by the bivariate AR(1) model for {X, = 
(Xa, Xoy). 

To test the models fitted above, we consider the next forty values {X,,¢ = 
251, ...,290}, which are stored in the file DJAOPCF.TSM. We can use these val- 
ues, in conjunction with the bivariate and univariate models fitted to the data for 
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t = 1,...,250, to compute one-step predictors of X;2, t = 251,..., 290. The re- 
sults are as follows: 


Predictor | Average Squared Error 


fi = 0.0309 4706 
AR(1) 4591 
VAR(1) 3962 


It is clear from these results that the sample variance of the series {Xj2,t = 251,..., 
290} is rather less than that of the series {Xj2,t = 1,..., 250}, and consequently, 
the average squared errors of all three predictors are substantially less than expected 
from the models fitted to the latter series. Both the AR(1) and VAR(1) models show 
an improvement in one-step average squared error over the sample mean Å, but the 
improvement shown by the bivariate model is much more pronounced. 


The calculation of predictors and their error covariance matrices for multivari- 
ate ARIMA and SARIMA processes is analogous to the corresponding univariate 
calculation, so we shall simply state the pertinent results. Suppose that {Y,} is a non- 
stationary process satisfying D(B)Y, = U, where D(z) = 1 — diz —---—d,z" isa 
polynomial with D(1) = 0 and {U,} is a causal invertible ARMA process with mean 
u. Then X, = U, — p satisfies 


®(B)X,= O(B)Z,, {Z} ~ WN, J). (7.6.16) 


Under the assumption that the random vectors Y_,+1,..., Yo are uncorrelated with 
the sequence {Z}, the best linear predictors P.Y ; of Y;, j >n > 0, based on | and 
the components of Y;, —r+1, < j < n, are found as follows. Compute the observed 
values of U, = D(B)Y,,t = 1,...,n, and use the ARMA model for X, = U, — u to 
compute predictors P,,U,,,;,. Then use the recursions 


PY nan = P,Un+n + >p dj PY aiiz (7.6.17) 


j=l 


to compute successively PY, FL PY, 125 PY, ,3, etc. The error covariance matrices 
are approximately (for large n) 


h-1 
E| nan — Pr¥ nt Ynsn Yad | = YP Evy, (7.6.18) 
j=0 
where Y7 is the coefficient of z/ in the power series expansion 
So Wiel = D(z) 1O"(2)O@), Iz) < 1. 
j=0 
The matrices Y* are most readily found from the recursions (7.4.8) after replacing 


®;, j = 1,..., p, by P7, j =1,...,p +r, where p is the coefficient of z/ in 
D(z) ®(z). 
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Example 7.6.4 


Remark 6. In the special case where @(z) = I (i.e., in the purely autoregressive 
case) the expression (7.6.18) for the h-step error covariance matrix is exact for all 
n > p (ie., if there are at least p + r observed vectors) . The program ITSM allows 
differencing transformations and subtraction of the mean before fitting a multivariate 
autoregression. Predicted values for the original series and the standard deviations 
of the prediction errors can be determined using the multivariate option Forecast- 
ing>AR Model. 


Remark 7. In the multivariate case, simple differencing of the type discussed in 
this section where the same operator D(B) is applied to all components of the random 
vectors is rather restrictive. It is useful to consider more general linear transformations 
of the data for the purpose of generating a stationary series. Such considerations lead 
to the class of cointegrated models discussed briefly in Section 7.7 below. 


Sales with a leading indicator 


Assume that the model fitted to the bivariate series {Y,, t = 0,..., 149} in Example 
7.6.2 is correct, i.e., that 


(B)X,=Z,, {Z} ~ WN (0, Ẹ), 
where 
X, = (1 — B)Y, — (.0228, 420)’, t=1,..., 149, 


(B) = T=O)8=- . .— Ê; B5, and Ê], E Ds, > are the matrices found in Example 
7.6.2. Then the one- and two-step predictors of X;sọ and Xj5, are obtained from 
(7.6.11) as 


x A .163 
Pi4Xıis0 = PyXi49 + +++ + OsXy45 = i ad 


and 


A A > —.027 
Pi49X151 = Dy Pi49X 59 + 2X149 + +--+ O5Xy46 = | | 


.816 


with error covariance matrices, from (7.6.15), 


J- | .076 ee | 


—.003  .095 
and 

a, _ [096 = 002 

2+ 6:26 =| oo al 


respectively. 
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7.7 Cointegration 


Similarly, the one- and two-step predictors of Yjs59 and Yj5 are obtained from 
(7.6.17) as 


~ 0228 13.59 

Piao Y 150 = | 420 | + Pi49X150 + Yi49 = h A 
and 

~ .0228 ~ 13.59 

Pis Y 151 = | 420 | + Pia9Xisi + Pis Y 150 = Be A 


with error covariance matrices, from (7.6.18), 


ņy-[| 06 -003 
~ | —.003 095 


and 
y+(1+8) F (1+8) = Ee pra 


respectively. The predicted values and the standard deviations of the predictors can 
easily be verified with the aid of the program ITSM. It is also of interest to compare 
the results with those obtained by fitting a transfer function model to the data as 
described in Section 10.1 below. 


We have seen that nonstationary univariate time series can frequently be made sta- 
tionary by applying the differencing operator V = 1 — B repeatedly. If {v4 X rt is 
stationary for some positive integer d but [vx r} is nonstationary, we say that {X,} 
is integrated of order d, or more concisely, {X,} ~ I(d). Many macroeconomic 
time series are found to be integrated of order 1. 

If {X,} is a k-variate time series, we define {v‘x,} to be the series whose jth 
component is obtained by applying the operator (1 — B)’ to the jth component of {X,}, 
j =1,...,. The idea of a cointegrated multivariate time series was introduced by 
Granger (1981) and developed by Engle and Granger (1987). Here we use the slightly 
different definition of Liitkepohl (1993). We say that the k-dimensional time series 
{X,} is integrated of order d (or {X,} ~ I(d)) if d is a positive integer, {V“X,} 
is stationary, and {v"'X,} is nonstationary. The /(d) process {X;} is said to be 
cointegrated with cointegration vector a if œ is ak x 1 vector such that {a’X,} is 
of order less than d. 
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Example 7.7.1 


Example 7.7.2 


A simple example is provided by the bivariate process whose first component is the 
random walk 


t 
Rea 2a ped lane AZ UD Oie");, 
j=l 


and whose second component consists of noisy observations of the same random 
walk, 


Y, =X, +W, t=1,2,..., {W.}~UD(0,7’), 


where {W,} is independent of {Z,}. Then {(X,, Y,)'} is integrated of order 1 and 
cointegrated with cointegration vector a = (1, —1)’. 

The notion of cointegration captures the idea of univariate nonstationary time 
series “moving together.” Thus, even though {X,} and {Y,} in Example 7.7.1 are both 
nonstationary, they are linked in the sense that they differ only by the stationary 
sequence {W,}. Series that behave in a cointegrated manner are often encountered 
in economics. Engle and Granger (1991) give as an illustrative example the prices 
of tomatoes U, and V, in Northern and Southern California. These are linked by the 
fact that if one were to increase sufficiently relative to the other, the profitability of 
buying in one market and selling for a profit in the other would tend to push the prices 
(U,, V) toward the straight line v = u in R?. This line is said to be an attractor 
for (U,, V;)’, since although U, and V, may both vary in a nonstationary manner as 
t increases, the points (U,, V,) will exhibit relatively small random deviations from 
the line v = u. 


If we apply the operator V = 1 — B to the bivariate process defined in Example 7.7.1 
in order to render it stationary, we obtain the series (U;, V,)’, where 


U, = Zt 
and 
V, = Zi + W, = W,-1. 


The series {(U,, V,)'} is clearly a stationary multivariate MA(1) process 


[l-l lla¢w)-L lawa] 


However, the process {(U;, V;)’} cannot be represented as an AR(oo) process, since 
the matrix [ 7] — z[_°, °,] has zero determinant when z = 1, thus violating condition 
(7.4.10). Care is therefore needed in the estimation of parameters for such models 
(and the closely related error-correction models). We shall not go into the details here 


but refer the reader to Engle and Granger (1987) and Liitkepohl (1993). 
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Problems 


7.1. 


7.2. 


7.3. 


7.4. 


7.5. 


7.6. 


Let {Y,} be a stationary process and define the bivariate process X; = Y,, 
Xn = Y,a, where d Æ 0. Show that {(X;,, X;2)’} is stationary and express its 
cross-correlation function in terms of the autocorrelation function of {Y,}. If 
py(h) +> Oas h — ov, show that there exists a lag k for which pı2(k) > p12 (0). 


Show that the covariance matrix function of the multivariate linear process 
defined by (7.2.12) is as specified in (7.2.13). 


Let {X,} be the bivariate time series whose components are the MA(1) processes 
defined by 


Xa = Za + 82-11, {Za} ~ WD (0, o?) ; 
and 
Xn =Z,2-.6Z12, {Z2} ~ ID (0, o2), 
where the two sequences {Z,} and {Z,.} are independent. 
a. Find a large-sample approximation to the variance of n'/?6,(h). 
b. Find a large-sample approximation to the covariance of n'/*6,.(h) and 
n! bio(k) for h Æ k. 


Use the characterization (7.5.3) of the multivariate best linear predictor of Y in 
terms of {X,,...X,,} to establish the orthogonality of the one-step prediction 
errors X; — X; and X; — X;, j # k, as asserted in (7.6.1). 


Determine the covariance matrix function of the ARMA(1,1) process satisfying 


X, — ÒX, = Z, + OZ, {Z} ~ WNO, h), 


0.5 ea ' 


where J) is the 2 x 2 identity matrix and = @' = [Y o; 


a. Let {X;} be a causal AR(p) process satisfying the recursions 
X, = Xite ®,X:_p +Z, {Z} ~ WN, %). 


For n > p write down recursions for the predictors P,X„+n, h > 0, and 
find explicit expressions for the error covariance matrices in terms of the AR 
coefficients and ¥ when h = 1, 2, and 3. 


b. Suppose now that {Y,} is the multivariate ARIMA (p, 1, 0) process satisfying 
VY, = X,, where {X,} is the AR process in (a). Assuming that E (YX!) = 0, 
for t > 1, show (using (7.6.17) with r = 1 and d = 1) that 


h 
Pa Y ntn) = Xi + X Py Xn j, 


j=l 


Problems 257 


7.7. 


7.8. 


7.9. 


and derive the error covariance matrices when h = 1,2, and 3. Compare 
these results with those obtained in Example 7.6.4. 


Use the program ITSM to find the minimum AICC AR model of order less 
than or equal to 20 for the bivariate series {(X,;, X2), t = 1,..., 200} with 
components filed as APPJK2.TSM. Use the fitted model to predict (X;,, XV, 
t = 201, 202, 203 and estimate the error covariance matrices of the predictors 
(assuming that the fitted model is appropriate for the data). 


Let {X,;,f =1,..., 63} and {X,.,t = 1,..., 63} denote the differenced series 
{V In Y,,;} and {V In Y,.}, where {Y,,} and {Y;.} are the annual mink and muskrat 
trappings filed as APPH.TSM and APPI.TSM, respectively). 


a. Use ITSM to construct and save the series {X,,} and {X;2} as univariate 
data files X1.TSM and X2.TSM, respectively. (After making the required 
transformations press the red EXP button and save each transformed series 
to a file with the appropriate name.) To enter X1 and X2 as a bivariate series 
in ITSM, open X1 as a multivariate series with Number of columns equal 
to 1. Then open X2 as a univariate series. Click the project editor button 
(at the top left of the ITSM window), click on the plus signs next to the 
projects X1.TSM and X2.TSM, then click on the series that appears just 
below X2.TSM and drag it to the first line of the project X1.TSM. It will 
then be added as a second component, making X1.TSM a bivariate project 
consisting of the two component series X1 and X2. Click OK to close the 
project editor and close the ITSM window labeled X2.TSM. You will then 
see the graphs of X1 and X2. Press the second yellow button to see the 
correlation functions of {X,,} and {X,2}. For more information on the project 
editor in ITSM select Help>Contents>Project Editor. 


b. Conduct a for independence of the two series {X,,} and {X;1}. 


Use ITSM to open the data file STOCK7.TSM, which contains the daily returns 
on seven different stock market indices from April 27th, 1998, through April 
9th, 1999. (Click on Help>Contents>Data sets for more information.) Fit a 
multivariate autoregression to the trivariate series consisting of the returns on 
the Dow Jones Industrials, All Ordinaries, and Nikkei indices. Check the model 
for goodness of fit and interpret the results. 
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8.1  State-Space Representations 

8.2 The Basic Structural Model 

8.3 State-Space Representation of ARIMA Models 
8.4 The Kalman Recursions 

8.5 Estimation For State-Space Models 

8.6 State-Space Models with Missing Observations 
8.7 The EM Algorithm 

8.8 Generalized State-Space Models 


In recent years state-space representations and the associated Kalman recursions have 
had a profound impact on time series analysis and many related areas. The techniques 
were originally developed in connection with the control of linear systems (for ac- 
counts of this subject see Davis and Vinter, 1985, and Hannan and Deistler, 1988). 
An extremely rich class of models for time series, including and going well beyond 
the linear ARIMA and classical decomposition models considered so far in this book, 
can be formulated as special cases of the general state-space model defined below in 
Section 8.1. In econometrics the structural time series models developed by Harvey 
(1990) are formulated (like the classical decomposition model) directly in terms of 
components of interest such as trend, seasonal component, and noise. However, the 
rigidity of the classical decomposition model is avoided by allowing the trend and 
seasonal components to evolve randomly rather than deterministically. An introduc- 
tion to these structural models is given in Section 8.2, and a state-space representation 
is developed for a general ARIMA process in Section 8.3. The Kalman recursions, 
which play a key role in the analysis of state-space models, are derived in Section 
8.4. These recursions allow a unified approach to prediction and estimation for all 
processes that can be given a state-space representation. Following the development 
of the Kalman recursions we discuss estimation with structural models (Section 8.5) 
and the formulation of state-space models to deal with missing values (Section 8.6). In 
Section 8.7 we introduce the EM algorithm, an iterative procedure for maximizing the 
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likelihood when only a subset of the complete data set is available. The EM algorithm 
is particularly well suited for estimation problems in the state-space framework. Gen- 
eralized state-space models are introduced in Section 8.8. These are Bayesian models 
that can be used to represent time series of many different types, as demonstrated by 
two applications to time series of count data. Throughout the chapter we shall use 
the notation 
{W,} ~ WNO, {R,}) 
to indicate that the random vectors W, have mean 0 and that 
R, ifs =t, 
E (W,W)) = 
0, otherwise. 
8.1 State-Space Representations 


A state-space model for a (possibly multivariate) time series {Y,,¢ = 1,2, ...} con- 
sists of two equations. The first, known as the observation equation, expresses the 
w-dimensional observation Y, as a linear function of a v-dimensional state variable 
X, plus noise. Thus 


Y,=G,X,+W,, t=1,2,..., (8.1.1) 


where {W,} ~ WN(O, {R,}) and {G;} is a sequence of w x v matrices. The second 
equation, called the state equation, determines the state X,,, at time t + 1 in terms 
of the previous state X, and a noise term. The state equation is 


Xa = FAX, +V, t=1,2,..., (8.1.2) 


where {F;,} is a sequence of v x v matrices, {V;} ~ WN(0, {Q,}), and {V,} is uncor- 
related with {W,} G.e., E(W,V‘) = 0 for all s and t). To complete the specification, 
it is assumed that the initial state X; is uncorrelated with all of the noise terms {V,} 
and {W,}. 


Remark 1. A more general form of the state-space model allows for correlation 
between V, and W, (see TSTM, Chapter 12) and for the addition of a control term 
H,u, in the state equation. In control theory, H,u, represents the effect of applying 
a “control” u, at time ¢ for the purpose of influencing X,,;. However, the system 
defined by (8.1.1) and (8.1.2) with E(W,V/) = 0 for all s and t will be adequate for 
our purposes. 


Remark 2. In many important special cases, the matrices F,, G,, Q,, and R, will 
be independent of t, in which case the subscripts will be suppressed. 
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Remark 3. It follows from the observation equation (8.1.1) and the state equation 
(8.1.2) that X, and Y, have the functional forms, for t = 2,3,..., 


X, = F,—X-1 + Vi-1 
=F; iF; 2X; 2+V, 2) + V; 1 


= (Fa 0 FDX: + (Fiai Vi + ee + Fa Vi-2 + Vi-1 
= fi(%i, Vi,..., Viz) (8.1.3) 
and 


Y, = (X1, Vi,..., Vi-1, WD. (8.1.4) 


Remark 4. From Remark 3 and the assumptions on the noise terms, it is clear that 
E (V.X) =0, E(V:Y)=0,. 1<s<t, 


and 


E(W,X) = 0). 1<s<t, E(W.Y')=0, 1l<s<t. 


Definition 8.1.1 A time series {Y,} has a state-space representation if there exists a state-space 
model for {Y,} as specified by equations (8.1.1) and (8.1.2). 


As already indicated, it is possible to find a state-space representation for a large 
number of time-series (and other) models. It is clear also from the definition that 
neither {X,} nor {Y,} is necessarily stationary. The beauty of a state-space representa- 
tion, when one can be found, lies in the simple structure of the state equation (8.1.2), 
which permits relatively simple analysis of the process {X,}. The behavior of {Y;} is 
then easy to determine from that of {X;,} using the observation equation (8.1.1). If the 
sequence {X,, Vi, V2, ...} is independent, then {X,} has the Markov property; i.e., 
the distribution of X,,; given X,,..., X; is the same as the distribution of X,+, given 
X,. This is a property possessed by many physical systems, provided that we include 
sufficiently many components in the specification of the state X, (for example, we 
may choose the state vector in such a way that X, includes components of X,_; for 
each f). 


Example 8.1.1 An AR(1) process 
Let {Y,} be the causal AR(1) process given by 
Y, = $Y,- +Z, {Z,}~WN(0,o’). (8.1.5) 
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Example 8.1.2 


In this case, a state-space representation for {Y,} is easy to construct. We can, for 
example, define a sequence of state variables X, by 


Sa. t=1,2..., (8.1.6) 


where X; = Yı = Das g/Z,_; and V, = Z,41. The process {Y,} then satisfies the 
observation equation 


Y, = X,, 


which has the form (8.1.1) with G, = 1 and W, = 0. 


An ARMA(1,1) process 
Let {Y,} be the causal and invertible ARMA(1,1) process satisfying the equations 
Y,=$Y,1+Z,+0Z-1, {Z} ~ WN (0, 0°). (8.1.7) 


Although the existence of a state-space representation for {Y,} is not obvious, we can 
find one by observing that 


X 
Y, =0(B)X,=|0 1 om 8.1.8 
= 0(B)X, = | I a l (8.1.8) 
where {X,} is the causal AR(1) process satisfying 

ġ(B)X: = Z,, 


or the equivalent equation 


X, _ 0 1 Xi 0 
Rac t A x, BPA ene 


Noting that X, = pa o/Z,_;, we see that equations (8.1.8) and (8.1.9) for t = 
1,2,... furnish a state-space representation of {Y,} with 


The extension of this state-space representation to general ARMA and ARIMA pro- 
cesses is given in Section 8.3. 


In subsequent sections we shall give examples that illustrate the versatility of 
state-space models. (More examples can be found in Aoki, 1987, Hannan and Deistler, 
1988, Harvey, 1990, and West and Harrison, 1989.) Before considering these, we need 
a slight modification of (8.1.1) and (8.1.2), which allows for series in which the time 
index runs from —œ to oo. This is a more natural formulation for many time series 
models. 
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State-Space Models with t € {0,+1,...} 


Consider the observation and state equations 


Y,=GX,+W,, ¢+=0,+1,..., (8.1.10) 


X4,= FX, +V, t=0,+1,..., (8.1.11) 


where F and G are v x v and w x v matrices, respectively, {V,} ~ WN(0, Q), 
{W,} ~ WN (0, R), and E (Y, W;) = 0 for all s, and t. 

The state equation (8.1.11) is said to be stable if the matrix F has all its eigen- 
values in the interior of the unit circle, or equivalently if det(Z7 — Fz) Æ 0 for all z 
complex such that |z| < 1. The matrix F is then also said to be stable. 

In the stable case the equations (8.1.11) have the unique stationary solution (Prob- 
lem 8.1) given by 


X, = oe FIV,_j-1. 
j=0 


The corresponding sequence of observations 


Y,=W, + >) GF'V,_;-1 


j=0 


is also stationary. 


8.2 The Basic Structural Model 


Example 8.2.1 


A structural time series model, like the classical decomposition model defined by 
(1.5.1), is specified in terms of components such as trend, seasonality, and noise, 
which are of direct interest in themselves. The deterministic nature of the trend 
and seasonal components in the classical decomposition model, however, limits its 
applicability. A natural way in which to overcome this deficiency is to permit random 
variation in these components. This can be very conveniently done in the framework 
of a state-space representation, and the resulting rather flexible model is called a 
structural model. Estimation and forecasting with this model can be encompassed in 
the general procedure for state-space models made possible by the Kalman recursions 
of Section 8.4. 


The random walk plus noise model 


One of the simplest structural models is obtained by adding noise to a random walk. 
It is suggested by the nonseasonal classical decomposition model 


Y,=M,+W,, where {W,} ~ WN (0, o2), (8.2.1) 
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Figure 8-1 

Realization from a random 
walk plus noise model. 
The random walk is 
represented by the solid 
line and the data are 
represented by boxes. 


and M, = m,, the deterministic “level” or “signal” at time t. We now introduce 
randomness into the level by supposing that M, is a random walk satisfying 


Mı =M, +V, and {V,} ~ WN (0, oĉ), (8.2.2) 


with initial value M, = mı. Equations (8.2.1) and (8.2.2) constitute the “local level” 
or “random walk plus noise” model. Figure 8.1 shows a realization of length 100 of 
this model with M, = 0, o? = 4, and o? = 8. (The realized values m, of M, are 
plotted as a solid line, and the observed data are plotted as square boxes.) The differ- 
enced data 


D, := VY, = Y, — Y1 = Vai t W — Wai, t22, 


constitute a stationary time series with mean 0 and ACF 
2 
p, (h) = Dal + a te 
0, if |A| > 1. 

Since {D,} is 1-correlated, we conclude from Proposition 2.1.1 that {D,} is an MA(1) 
process and hence that {Y,} is an ARIMA(0,1,1) process. More specifically, 

D,=Z,+0Z,1, {Z,}~ WN (0,07), (8.2.3) 
where 6 and o? are found by solving the equations 

0 —o? 3 


= 2— 
148 ~ 262402 and 60° = —o;. 


For the process {Y,} generating the data in Figure 8.1, the parameters 9 and øo? of 


30 
T 


10 
T 


| 
0 20 40 60 80 100 
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the differenced series {D,} satisfy 9/(1 + 67) = —.4 and 0o? = —8. Solving these 
equations for 6 and o°, we find that 6 = —.5 and o? = 16 (or 0 = —2 and o° = 4). 
The sample ACF of the observed differences D, of the realization of {Y,} in Figure 
8.1 is shown in Figure 8.2. 

The local level model is often used to represent a measured characteristic of the 
output of an industrial process for which the unobserved process level {M,} is intended 
to be within specified limits (to meet the design specifications of the manufactured 
product). To decide whether or not the process requires corrective attention, it is 
important to be able to test the hypothesis that the process level {M,} is constant. 
From the state equation, we see that {M,} is constant (and equal to mı) when V, = 0 
or equivalently when o? = 0. This in turn is equivalent to the moving-average model 
(8.2.3) for {D,} being noninvertible with 6 = —1 (see Problem 8.2). Tests of the unit 
root hypothesis 9 = —1 were discussed in Section 6.3.2. 

The local level model can easily be extended to incorporate a locally linear trend 
with slope £, at time t. Equation (8.2.2) is replaced by 

M, = M,-1 + By-1 + Vi-1, (8.2.4) 
where B,;_; = £,-1. Now if we introduce randomness into the slope by replacing it 
with the random walk 

B, = B,-1 + U;-1, where {U,} ~ WN (0, oĉ), (8.2.5) 
we obtain the “local linear trend” model. 

G 
Pe oO E EE EEEE EEE ETE PETET AEE TE eis hain EEE 
So i | | | | | | it | | | fi — fi | i fi fi 
fo) ] | | T | T | | ] T | T | | T 
Figure 8-2 Sb 
Sample ACF of the series | | | | | 
0 10 20 30 40 


obtained by differencing 
the data in Figure 8.1. 
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To express the local linear trend model in state-space form we introduce the state 
vector 


X, = (M,, B,)’. 


Then (8.2.4) and (8.2.5) can be written in the equivalent form 


xa =f | [sv t= (8.2.6) 


where V, = (V,, U,)’. The process {Y,} is then determined by the observation equation 
Y,=[1 0]X, +W.. (8.2.7) 


Tf (X), U1, Vi, Wi, Us, Vo, Wo, . . .} is an uncorrelated sequence, then equations (8.2.6) 
and (8.2.7) constitute a state-space representation of the process {Y,}, which is a 
model for data with randomly varying trend and added noise. For this model we have 
v=2,w=1, 


1 1 o 0 — 
el if G=[l 0], S A and R = o. 


A seasonal series with noise 


The classical decomposition (1.5.11) expressed the time series {X,} as a sum of 
trend, seasonal, and noise components. The seasonal component (with period d) was 
a sequence {s,} with the properties s,,, = s, and y S; = 0. Such a sequence can 
be generated, for any values of s1, So, ..., S-4+3, by means of the recursions 


Sap = S — tt — Std, t= 1,2,.... (8.2.8) 


A somewhat more general seasonal component {Y,}, allowing for random deviations 
from strict periodicity, is obtained by adding a term S, to the right side of (8.2.8), 
where {V,} is white noise with mean zero. This leads to the recursion relations 


Yiu. = —Y, — e — Y,a42 + Sr, eal a ae (8.2.9) 


To find a state-space representation for {Y,} we introduce the (d — 1)-dimensional 
state vector 


RS, Yis Geo) 
The series {Y,} is then given by the observation equation 

Y,=[1 0 0... 0X, t=1,2,..., (8.2.10) 
where {X,} satisfies the state equation 


Xa = FX, +V, t=1,2..., (8.2.11) 
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V, = (S;,0,..., 0)’, and 
—-1 -l -1 -l 
1 0 0 0 
F=| 0 1 0 0 
0 0 1 0 
Example 8.2.3. A randomly varying trend with random seasonality and noise 


A series with randomly varying trend, random seasonality and noise can be con- 
structed by adding the two series in Examples 8.2.1 and 8.2.2. (Addition of series 
with state-space representations is in fact always possible by means of the following 
construction. See Problem 8.9.) We introduce the state vector 


x! 
a 


where X! and X? are the state vectors in (8.2.6) and (8.2.11). We then have the follow- 
ing representation for {Y,}, the sum of the two series whose state-space representations 
were given in (8.2.6)—(8.2.7) and (8.2.10)-(8.2.11). The state equation is 


F 0 y! 
O | f P x + pi (8.2.12) 


where F\, F, are the coefficient matrices and {V!}, {V?} are the noise vectors in the 
state equations (8.2.6) and (8.2.11), respectively. The observation equation is 


Y,=[1 0 1 O--- O]X,+ W,, (8.2.13) 


where {W,} is the noise sequence in (8.2.7). If the sequence of random vectors 
{X1, Vi, Vi, Wi, V3, V5, W2, .. .}is uncorrelated, then equations (8.2.12) and (8.2.13) 
constitute a state-space representation for {Y;}. 


8.3 State-Space Representation of ARIMA Models 


Example 8.3.1 


We begin by establishing a state-space representation for the causal AR(:p) process 
and then build on this example to find representations for the general ARMA and 
ARIMA processes. 

State-space representation of a causal AR(p) process 


Consider the AR(:p) process defined by 
Yi41 =, $2Y 1-1 cae bpVi- pri + Ziq; F=Ostly x... (8.3.1) 
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where {Z,} ~ WN(0, 0°), and @(z) := 1 — ız —--- — pz” is nonzero for |z| < 1. 
To express {Y,} in state-space form we simply introduce the state vectors 


x 
| 
3 


x 
| 

v 

N 


X,= “1, t=0,1,.... (8.3.2) 
Led 
From (8.3.1) and (8.3.2) the observation equation is 
Y,=[0 0 O--- 1]X,, t=0,+1,..., (8.3.3) 


while the state equation is given by 


Xa =|: : : et JX +| : [Za; t=0,41,.... (8.3.4) 
0 0 0 vee 0 
$p Pp-1 p-2 oe: Pi 


These equations have the required forms (8.1.10) and (8.1.11) with W, = 0 and 
V, = (0,0,..., Z,4;)',t =0,+1,.... 


Remark 1. In Example 8.3.1 the causality condition ¢ (z) 4 0 for |z| < 1 is equiv- 
alent to the condition that the state equation (8.3.4) is stable, since the eigenvalues 
of the coefficient matrix in (8.3.4) are simply the reciprocals of the zeros of ¢ (z) 
(Problem 8.3). 


Remark 2. If equations (8.3.3) and (8.3.4) are postulated to hold only for t = 
1,2,..., and if X; is a random vector such that {X,, Z1, Z2, ...} is an uncorrelated 
sequence, then we have a state-space representation for {Y,} of the type defined 
earlier by (8.1.1) and (8.1.2). The resulting process {Y,} is well-defined, regardless 
of whether or not the state equation is stable, but it will not in general be stationary. 
It will be stationary if the state equation is stable and if X; is defined by (8.3.2) with 
Y, = eo Wy Z-pt=1,0,...,2— p and YE) = 1/6), z| <1. 


State-space form of a causal ARMA(p, q) process 


State-space representations are not unique. Here we shall give one of the (infinitely 
many) possible representations of a causal ARMA(p,q) process that can easily be 
derived from Example 8.3.1. Consider the ARMA(p,q) process defined by 


$(B)Y, = 0(B)Z,, t=0,+1,..., (8.3.5) 
where {Z,} ~ WN(0, o°) and ¢(z) £ 0 for |z| < 1. Let 
r=max(p,g+1), ¢;=0 forj>p, 6;=0 forj>q, and ®=1. 
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Example 8.3.3 


If {U,} is the causal AR(p) process satisfying 

o(B)U, = Z,, (8.3.6) 
then Y, = 6(B)U,, since 

P(B)Y, = O(B)O(B)U, = 0 (B)ġ (B)U, = O(B)Z,. 
Consequently, 

Y,=[6,-1 6-2 +++ Oo]X,, (8.3.7) 


X,= . i | (8.3.8) 
bu | 


0 1 O > 0 0 


where 


I 
SS 
4 4 
N =e 


Laasi Doon, ot Ala t=0,41,.... (8.3.9) 


0 0 0 vee I 0 
Qr $r-1 Pr-2 D Qı 1 


Equations (8.3.7) and (8.3.9) are the required observation and state equations. As 
in Example 8.3.1, the observation and state noise vectors are again W, = 0 and 
V, = (0,0, ..., Z441), t =0,+1,.... 


State-space representation of an ARIMA (p, d, q) process 
If {Y,} is an ARIMA (p, d, q) process with {V“Y,} satisfying (8.3.5), then by the 


preceding example {vé y,} has the representation 

v’Y,=GX;,, t=0,+1,..., (8.3.10) 
where {X,} is the unique stationary solution of the state equation 

X41 = FX, + V,, 


F and G are the coefficients of X, in (8.3.9) and (8.3.7), respectively, and V, = 
(0,0, ..., Z;41)’. Let A and B be the d x 1 and d x d matrices defined by A = B = 1 
if d = 1 and 


© 
© 
O = 
© 
© 


7 ; i 0 ae 
1 De) Say DEG) ot 
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if d > 1. Then since 


d 
d . 
avy (eve, (8.3.11) 
j=l \J 
the vector 
Veet (Y asi Y, 1) 


satisfies the equation 
Y, = AV“Y, + BY,_; = AGX, + BY,ı. 


Defining a new state vector T, by stacking X, and Y,_;, we therefore obtain the state 
equation 


[Xal] |F 0 V, _ 
tie ldo 2 (+(¥ } a ae (8.3.12) 


and the observation equation, from (8.3.10) and (8.3.11), 


E oa ey a E = X, 
a et a ce és E? ae) a| a 


t (8.3.13) 


Il 
a 
N 


with initial condition 


rv | 
Tı = A = bs eal (8.3.14) 
el | 
Yo 
and the assumption 
EV¥YoZ)) =0, t=0,+1,..., (8.3.15) 


where Yo = (Yi-4, Yo_g,..., Yo)’. The conditions (8.3.15), which are satisfied in 
particular if Yo is considered to be nonrandom and equal to the vector of observed 
values (y1-a, Y2-a,---, Yo)’, are imposed to ensure that the assumptions of a state- 
space model given in Section 8.1 are satisfied. They also imply that E (X 1Y6) = Oand 
E (YV! Y/) = 0, t > 1, as required earlier in Section 6.4 for prediction of ARIMA 
processes. 

State-space models for more general ARIMA processes (e.g., {Y,} such that 
{VVi2¥;} is an ARMA(p, q) process) can be constructed in the same way. See Prob- 
lem 8.4. 


For the ARIMA(1, 1, 1) process defined by 


(1—@B)(1— B)Y, =(1+6B)Z,, {Z,} ~ WN (0, o°), 
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the vectors X, and Y,;_, reduce to X, = (X;_;, X,)/ and Y,;_; = Y;_,. From (8.3.12) 
and (8.3.13) the state-space representation is therefore (Problem 8.8) 


Xi-1 
y,=[0 1 1] X, |, (8.3.16) 
Y,—ı 
where 
Xı 0 1 0 Xi-1 0 
Kees et OS ie? Ol | ae. rela. Pek GAN 
Y, o 1 aill Za 0 
and 
eZ; 
Xo j=0 
XxX, |= SO (8.3.18) 
Yo 3 Ais 


8.4 The Kalman Recursions 


In this section we shall consider three fundamental problems associated with the state- 
space model defined by (8.1.1) and (8.1.2) in Section 8.1. These are all concerned 
with finding best (in the sense of minimum mean square error) linear estimates of 


the state-vector X, in terms of the observations Y,, Y>,..., and a random vector Yo 
that is orthogonal to V, and W, for all t > 1. In many cases Yọ will be the constant 
vector (1, 1,..., 1)’. Estimation of X, in terms of: 


a. Yo,..., Y;_; defines the prediction problem, 
b. Yo,..., Y, defines the filtering problem, 
c. Yo,..., Y, (n > t) defines the smoothing problem. 


Each of these problems can be solved recursively using an appropriate set of Kalman 
recursions, which will be established in this section. 

In the following definition of best linear predictor (and throughout this chapter) 
it should be noted that we do not automatically include the constant 1 among the 
predictor variables as we did in Sections 2.5 and 7.5. (It can, however, be included 
by choosing Yo = (1, 1,..., 1V.) 
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Definition 8.4.1 


For the random vector X = (X),..., X% y, 
PS) := (PAA a PXD, 


where P,(X;) := P(X;|Yo, Yi, ..., Y;), is the best linear predictor of X; in terms 
of all components of Yo, Y,,..., Y;. 


Remark 1. By the definition of the best predictor of each component X; of X, 
P,(X) is the unique random vector of the form 


P(X) = AoYo +-:-+A,Y; 
with v x w matrices Ag, ..., A; such that 
[X—P,(X)] LY,, s=0,...,t 


(cf. (7.5.2) and (7.5.3)). Recall that two random vectors X and Y are orthogonal 
(written X L Y) if ECXY’) is a matrix of zeros. 


Remark 2. Ifall the components of X, Y;,..., Y, are jointly normally distributed 
and Yọ = (1,..., 1)’, then 


P(X) = ECX|Y,,..., Y,), t>1. 


Remark 3. P, is linear in the sense that if A is any k x v matrix and X, V are two 
v-variate random vectors with finite second moments, then (Problem 8.10) 


P,(AX) = AP,(X) 


and 


P(X+V) = P,(X) she P,(V). 


Remark 4. If X and Y are random vectors with v and w components, respectively, 
each with finite second moments, then 


P(X|Y) = MY, 


where M is a vx w matrix, M = E(XY’)[E(YY’)]~! with [E(YY’)]~! any generalized 
inverse of E(YY’). (A generalized inverse of a matrix S is a matrix S~! such that 
SS~'S = S. Every matrix has at least one. See Problem 8.11.) 

In the notation just developed, the prediction, filtering, and smoothing problems 
(a), (b), and (c) formulated above reduce to the determination of P,_,(X;), P;(X;), 
and P,,(X;) (n > t), respectively. We deal first with the prediction problem. 
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Kalman Prediction: 
For the state-space model (8.1.1)—(8.1.2), the one-step predictors X, := P,_,(X;) 
and their error covariance matrices Q, = E[(X, — X,)(X, — $,)] are uniquely 
determined by the initial conditions 
Xi = P(KilYo), Q) = E[(X — Xi) (Xi = Xi) ] 
and the recursions, fort = 1,..., 
Ki = FÅ, +.0,4;1(¥,- GX), (8.4.1) 
Qui = F, F! + Q, — 0,A;7'0}, (8.4.2) 
where 
A; = Giu G, + R,, 
O, = F; QG, 
and A7! is any generalized inverse of A,. 
Proof We shall make use of the innovations I, defined by Iọ = Yo and 


I =Y,- PAY, =Y¥,-GX,=G,(X,-%)+W, 1=1,2.... 
The sequence {I,} is orthogonal by Remark 1. Using Remarks 3 and 4 and the relation 
PC) = PG) + PCL) (8.4.3) 
(see Problem 8.12), we find that 
Kis = Pri Xa) + PAm) = PFX, + Vi) + OA; 
= F,X,+ 0,A7'L, (8.4.4) 
where 
A, = Ed, L) = G, G! + R,, 
Ə, = EXI) = E [ (FX, + V,) ([X, -$ ] c; + w;)| 
= F,2,G'. 
To verify (8.4.2), we observe from the definition of Q,,, that 


yi = E (X41X),,) — E (Rakia) ‘ 
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With (8.1.2) and (8.4.4) this gives 
Qui = FE(X,XI)F] + 0; — FE (XX) F; - ©,A;'0, 
= FF! + Q, — 0, A'O. E 


h-step Prediction of {Y,} Using the Kalman Recursions 


The Kalman prediction equations lead to a very simple algorithm for recursive calcu- 


lation of the best linear mean square predictors P,Y,,,,4 = 1,2,.... From (8.4.4), 
(8.1.1), (8.1.2), and Remark 3 in Section 8.1, we find that 
PX = F,P,-1X, F @,A;'(Y, = 1-1 V7), (8.4.5) 


P Xin = Frpn—1PrXi4n-1 


= (Fi4n—1 Fi4n—2+°+ Fig) PX, h=2,3,..., (8.4.6) 


and 

PN ar m Gra Pens WHA Das (8.4.7) 
From the relation 

Xin — P Xin = Fin- Xien- — PrXipn-1) + Vign-1, 2=2,3,..., 
we find that Q” = E[(X; n — PiX) Xin — P,Xi+n)’] satisfies the recursions 

QP = Fona QP F, pa + Qn REDS as: (8.4.8) 


with Q = 2,41. Then from (8.1.1) and (8.4.7), AM := E[(Yian — PY an) (Yin — 
PY ,4n)'] is given by 


AP = Gr OU G p + Ran bl Deak: (8.4.9) 


Consider the random walk plus noise model of Example 8.2.1 defined by 
Y, =X, +W, {W} ~ WN (0, o$). 

where the local level X, follows the random walk 
Xai =X, +V, {V} ~ WN (0, o2). 


Applying the Kalman prediction equations with Yo := 1, R = 02, and Q = oĉ, we 
obtain 


3 . ©, z 
Pn = RY =E (¥, - 2) 
t 


= (1 —a,)Y, + ay; 
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where 
©, Q 
g E = 


A Q,+02° 


For a state-space model (like this one) with time-independent parameters, the solution 
of the Kalman recursions (8.4.2) is called a steady-state solution if Q, is independent 
of t. If Q, = Q for all t, then from (8.4.2) 

Q? 2o? 3 


= = 2 = w 
Qa = Q= +0; Gas? Oe Ho 


ve 


Solving this quadratic equation for Q and noting that Q > 0, we find that 


1 
= 5 (02 + Jos + 40203) 


Since Q,,; — Q, is a continuous function of Q, on Q, > 0, positive at Q, = 0, 
negative for large Q,, and zero only at Q, = Q, it is clear that Q,,; — Q, is negative 
for Q, > Q and positive for Q, < Q. A similar argument shows (Problem 8.14) that 
(Q1 — Q)(Q,; — Q) > 0 for all Q, > 0. These observations imply that Q;,; always 
falls between Q and Q,. Consequently, regardless of the value of (2), Q, converges 
to Q, the unique solution of Q,.; = Q,. For any initial predictors y, =X ı and any 
initial mean squared error Q; = E(X; — Xi). the coefficients a, := 9 / (Q2, + o2) 
converge to 


Q 
a = ——., 
Q +02 
and the mean squared errors of the predictors defined by 


Yi =(1— a,)Y, + a,Y; 


converge to Q + ož. 
If, as is often the case, we do not know Q,, then we cannot determine the sequence 
{a,}. It is natural, therefore, to consider the behavior of the predictors defined by 


Yiu. =(1—a)Y, +aY, 


with a as above and arbitrary Y;. It can be shown (Problem 8.16) that this sequence 
of predictors is also asymptotically optimal in the sense that the mean squared error 
converges to Q +02 ast > ov. 

As shown in Example 8.2.1, the differenced process D, = Y, — Y,—ı is the MA(1) 
process 


D, = Z,+0Z,-1, {Z} ~ WN (0,07), 
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Proof 


where 6/ (1 +6?) = —o2/ (202 + 07). Solving this equation for 6 (Problem 8.15), 
we find that 


1 
0 = F (202 o? fo 4 40303) 
Ow 


and that 0 =a — 1. 
It is instructive to derive the exponential smoothing formula for Y, directly from 
the ARIMA(0,1,1) structure of {Y,}. For t > 2, we have from Section 6.5 that 


Vivi =Y, H 0a (Y, Ê) = OnY, qd On); 


for t > 2, where @,, is found by application of the innovations algorithm to an MA(1) 
process with coefficient 0. It follows that 1—a, = —0,,, and since 6,, —> 0 (see Remark 
1 of Section 3.3) and a, converges to the steady-state solution a, we conclude that 


1—a = lim (1 — a) = — lim 0, = —6. 
t—œ0 t>oo 


Kalman Filtering: 


The filtered estimates X,,, = P,(X,) and their error covariance matrices Q, = 
E[(X,; — Xs) (CX; — Xy,)’] are determined by the relations 


PX, = POX AGG (x, = G.X,) (8.4.10) 


and 


Qu = Q — WGA, G00 (8.4.11) 


From (8.4.3) it follows that 
PX, = PX, + MI,, 
where 
M = E(X, DIE ID]! = E[X (G(X, — Ñ) + W,) Ap! = 2,67". (8.4.12) 
To establish (8.4.11) we write 


X, — PX, = X, P,X, + P,X, PX, = X, — P,X, + ML. 


Using (8.4.12) and the orthogonality of X, — P,X, and MI,, we find from the last 
equation that 


Q, = Qu + Q GAT G, , 


as required. E 
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Kalman Fixed-Point Smoothing: 


The smoothed estimates X,,, = P,X, and the error covariance matrices Qan = 
E(CX, — Xin) (X; — X; n)'] are determined for fixed ¢ by the following recursions, 
which can be solved successively for n = t,t + 1,...: 


POH PSN OG z a G,X,) l (8.4.13) 
Qinti = Qy nl Fn =. @,A;,'Grl, (8.4.14) 
Qin T Qijn—1 Qr n GAR Gnr p, (8.4.15) 


with initial conditions P,_,;X, = x, and Q,, = Q4),-1 = Q, (found from Kalman 
prediction). 


A 


Proof Using (8.4.3) we can write P,X, = P,-1X, + CI,,, where I, = G,(X, — X,) + Wy. 
By Remark 4 above, 


C=E [XG (x, = $.) 4: wa] [E (E) = 2inGAz!, (8.4.16) 


where Qn := E[(X, — X,)(X, — X,)/]. It follows now from (8.1.2), (8.4.5), the 
orthogonality of V,, and W, with X, — X,, and the definition of &,,„ that 


On [(x: = $.) (x, = $.) (F, - ©,4;"Gr) | = Qn |F, - 0,A7'G,] , 
thus establishing (8.4.14). To establish (8.4.15) we write 
X, — P,X, = X, — P, X, — CL. 
Using (8.4.16) and the orthogonality of X, — P,X, and I, the last equation then gives 
Qun = Qin-1 — Q nG A Gr Yap’ nst t+l,..., 


as required. E 


Estimation For State-Space Models 


Consider the state-space model defined by equations (8.1.1) and (8.1.2) and suppose 
that the model is completely parameterized by the components of the vector 0. The 
maximum likelihood estimate of 0 is found by maximizing the likelihood of the obser- 
vations Y;,..., Y, with respect to the components of the vector 0. If the conditional 
probability density of Y, given Y,_; = y,;-1,..-, Yo = Yo is fiCl¥;-1, ---, Yo), then 
the likelihood of Y,, = 1,..., n (conditional on Yo), can immediately be written as 


LEY Y= [E cases YD): (8.5.1) 


t=1 
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The calculation of the likelihood for any fixed numerical value of @ is extremely 
complicated in general, but is greatly simplified if Yo, X; and W,, V,, t = 1,2,..., 
are assumed to be jointly Gaussian. The resulting likelihood is called the Gaussian 
likelihood and is widely used in time series analysis (cf. Section 5.2) whether the time 
series is truly Gaussian or not. As before, we shall continue to use the term likelihood 
to mean Gaussian likelihood. 

If Yo, X; and W,, V;, t = 1,2,..., are jointly Gaussian, then the conditional 
densities in (8.5.1) are given by 


1 
F Y- Yo) = 2r)” (detA,)'? exp [- na; | ; 
where I, = Y, — P, Y, = Y, — GX,, P,-:Y,, and A,, t > 1, are the one-step pre- 
dictors and error covariance matrices found from the Kalman prediction recursions. 
The likelihood of the observations Y,,..., Y,, (conditional on Yọ) can therefore be 
expressed as 


-1/2 
n 1 n 
EG Yi,- Yp) = Qr) (fea) exp |- $ra]. (8.5.2) 
j=l 


j=l 

Given the observations Y,,..., Y,,, the distribution of Yo (see Section 8.4), and a 
particular parameter value 0, the numerical value of the likelihood L can be computed 
from the previous equation with the aid of the Kalman recursions of Section 8.4. To 
find maximum likelihood estimates of the components of 0, a nonlinear optimization 
algorithm must be used to search for the value of @ that maximizes the value of L. 

Having estimated the parameter vector 0, we can compute forecasts based on the 
fitted state-space model and estimated mean squared errors by direct application of 
equations (8.4.7) and (8.4.9). 


Application to Structural Models 


The general structural model for a univariate time series {Y,} of which we gave 
examples in Section 8.2 has the form 


Y,=GX,+W,, {W,} ~WN(0,o;), (8.5.3) 
Xai = FX, +V, {Vi} ~ WNO, Q), (8.5.4) 
for t = 1,2,..., where F and G are assumed known. We set Yọ = 1 in order to 


include constant terms in our predictors and complete the specification of the model 
by prescribing the mean and covariance matrix of the initial state X4. A simple and 
convenient assumption is that X; is equal to a deterministic but unknown parameter 
u and that XxX, = p, so that Q, = 0. The parameters of the model are then u, Q, 
and ož. 

Direct maximization of the likelihood (8.5.2) is difficult if the dimension of the 
state vector is large. The maximization can, however, be simplified by the following 
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stepwise procedure. For fixed Q we find f4(Q) and o? (Q) that maximize the likelihood 
L (u, Q, 2). We then maximize the “reduced likelihood” L (A(O), Q, 67(Q)) with 
respect to Q. 

To achieve this we define the mean-corrected state vectors, X* = X, — F lu, 
and apply the Kalman prediction recursions to {X*} with initial condition X} = 0. 


This gives, from (8.4.1), 
X= FX*+0,a7! (z = oX;) a ee (8.5.5) 


with x: = 0. Since Š, also satisfies (8.5.5), but with initial condition Š, = p, it 
follows that 
R, =X*4+ Cy (8.5.6) 


for some v x v matrices C,. (Note that although Š, = P(X,|Yo, Yı, ..., Y,), the quan- 
tity x" is not the corresponding predictor of X¥.) The matrices C, can be determined 
recursively from (8.5.5), (8.5.6), and (8.4.1). Substituting (8.5.6) into (8.5.5) and 
using (8.4.1), we have 


Ri, = F ($, - Cu) + 0,47" (Y, - 6 (R: - Cn) 
= FX, + 0,47" (Y, - G&,) - (F - @,4;'G) Cu 


= X41 — (F — @,A;'G) Crp, 


so that 
Ci41 = (F — ©,A7'G)C, (8.5.7) 
with C, equal to the identity matrix. The quadratic form in the likelihood (8.5.2) is 
therefore 
KD 
„e (0) 
S(u, Q, o) = 2), —— (8.5.8) 
m 2, T 
n (Y, - GX - Gcn) 
= ; (8.5.9) 


Now let Q* := o,,”@ and define L* to be the likelihood function with this new 
parameterization, i.e., L* (u, Q*, 02) = L (p, 02 O*, 02). Writing A* = 057A, and 


Q* = o,’ Q, we see that the predictors X* and the matrices C; in (8.5.7) depend on 
the parameters only through Q*. Thus, 


S (p, Q, 03) = 0; S(p, Q*,1), 
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so that 


—2In L* (u, Q*, 02) =ninQzx)+ X Ind, +03 S(p, O*, 1) 


n 


t=1 


=nln(27) +) In Až +nlno; +0,7S(u, Q*, 1). 


t=1 


For Q* fixed, it is easy to show (see Problem 8.18) that this function is minimized 


when 
=l 25 
a n ŒG (Y, = oX;) 
ar C’'G'GC i f t 
fu = à (0*) = bs ee | > ua (8.5.10) 
t=1 t t=1 t 
and 
R 2 
: (i zG GC, jx) 
a =o O17). (8.5.11) 


A* 


t=1 t 
Replacing u and oĉ by these values in —2 In L* and ignoring constants, the reduced 
likelihood becomes 


€(Q*)=In D X ) ' n-! n (det 4$). (8.5.12) 
t t=1 


t=1 


If O* denotes the minimizer of (8.5.12), then the maximum likelihood estimator of the 
parameters u, Q, 0? are ft, ô? O*, 62, where fi and 62 are computed from (8.5.10) 
and (8.5.11) with Q* replaced by O*. 

We can now summarize the steps required for computing the maximum likelihood 


estimators of u, Q, and o? for the model (8.5.3)-(8.5.4). 


1. For a fixed Q*, apply the Kalman prediction recursions with Å: = 0, Q; = 0, 
Q = Q*, and ož = 1 to obtain the predictors x Let A* denote the one-step 
prediction error produced by these recursions. 7 

2. Set ju = AO”) = [X CIG GCA] E; CGY, — GX*)/At. 

3. Let Q* be the minimizer of (8.5.12). 

4. The maximum likelihood estimators of u, Q, and o? are then given by ft, 62 O*, 
and ĉĉ, respectively, where fz and G? are found from (8.5.10) and (8.5.11) eval- 
uated at O*. 


Example 8.5.1 Random walk plus noise model 
In Example 8.2.1, 100 observations were generated from the structural model 
Y,=M,+W,, {W.}~WN(0,o,), 
Mı =M; +V, {V;} ~ WN (0, o2), 
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with initial values u = Mı = 0, 02 = 8, and o? = 4. The maximum likelihood 
estimates of the parameters are found by first minimizing (8.5.12) with jz given by 
(8.5.10). Substituting these values into (8.5.11) gives G2. The resulting estimates are 
ft = .906, 6? = 5.351, and 62 = 8.233, which are in reasonably close agreement 
with the true values. 

Example 8.5.2 International airline passengers, 1949-1960; AIRPASS.TSM 


The monthly totals of international airline passengers from January 1949 to December 
1960 (Box and Jenkins, 1976) are displayed in Figure 8.3. The data exhibit both a 
strong seasonal pattern and a nearly linear trend. Since the variability of the data 
Y\,..., Yı44 increases for larger values of Y,, it may be appropriate to consider a 
logarithmic transformation of the data. For the purpose of this illustration, however, 
we will fit a structural model incorporating a randomly varying trend and seasonal 
and noise components (see Example 8.2.3) to the raw data. This model has the form 


Y,=GX,+W,, {W.} ~ WN (0, ož), 
X1 = FX, + V,, {Vi} ~ WNO, Q), 


where X, is a 13-dimensional state-vector, 


11 0 0 0 0 
01 0 0 0 0 
OO. Stet EM. el 
F=-|00 1 0 0 Of, 
000 1 0 0 
ae ol 
G= CO? © 0], 
and 
LES J 
0 of 0 0 0 
o 9 oo 0 
2=|0o o 0 0 o f: 
OO: ONO: R, 


The parameters of the model are u, 07, 03, 07, and o2, where u = X;. Minimizing 
(8.5.12) with respect to Q* we find from (8.5.11) and (8.5.12) that 


(67, 63, 65,62) = (170.63, .00000, 11.338, .014179) 


w 
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Figure 8-3 
International airline 
passengers; monthly 


totals from January 1949 


to December 1960. 


Figure 8-4 

The one-step predictors 
(Xn , Xo, ka) for the 
airline passenger data 
in Example 8.5.2. 
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and from (8.5.10) that ~ = (146.9, 2.171, 


J 
1957 


34.92, 


l 
1959 


34.12, —47.00, 


J 
1961 


16.98, 22.99, 


53.99, 58.34, 33.65, 2.204, —4.053, —6.894)’. The first component, X,ı, of the state 
vector corresponds to the local linear trend with slope X,2. Since 67 = 0, the slope 


at time t, which satisfies 


Xn = X12 + Vn, 
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Figure 8-5 

The one-step predictors ¥; 

for the airline passenger 
data (solid line) and the i f f f f f f f f f f 

actual data (square boxes). 1949 1951 1953 1955 1957 1959 


100 


J 
1961 


must be nearly constant and equal to Xj, = 2.171. The first three components of the 
predictors X, are plotted in Figure 8.4. Notice that the first component varies like a 
random walk around a straight line, while the second component is nearly constant as 
a result of 6; ~ 0. The third component, corresponding to the seasonal component, 
exhibits a clear seasonal cycle that repeats roughly the same pattern throughout the 
12 years of data. The one-step predictors X,; + X, of Y, are plotted in Figure 8.5 
(solid line) together with the actual data (square boxes). For this model the predictors 
follow the movement of the data quite well. 


8.6 State-Space Models with Missing Observations 


State-space representations and the associated Kalman recursions are ideally suited 
to the analysis of data with missing values, as was pointed out by Jones (1980) in the 
context of maximum likelihood estimation for ARMA processes. In this section we 
shall deal with two missing-value problems for state-space models. The first is the 
evaluation of the (Gaussian) likelihood based on {Y;,,..., Y;,}, where ij, i2,..., i, 
are positive integers such that 1 < ip < ip < --: < i, < n. (This allows for 
observation of the process {Y,} at irregular intervals, or equivalently for the possibility 
that (n —r) observations are missing from the sequence {Y,, ..., Y,,}.) The solution of 
this problem will, in particular, enable us to carry out maximum likelihood estimation 
for ARMA and ARIMA processes with missing values. The second problem to be 
considered is the minimum mean squared error estimation of the missing values 
themselves. 
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The Gaussian Likelihood of {Y;,,...,Y;,}, |< ti<b<---<i,-<n 


Consider the state-space model defined by equations (8.1.1) and (8.1.2) and suppose 
that the model is completely parameterized by the components of the vector 0. If 


there are no missing observations, i.e., ifr = n andi; = j, j = 1,...,n, then the 
likelihood of the observations {Y,,..., Y,,} is easily found as in Section 8.5 to be 
n —1/2 1 n 
HO Yi, , Yn) = Oa (i sa) exp 5 Frar] , 
j=1 j=l 


where I; = Y; — Pj-,Y; and Aj, j > 1, are the one-step predictors and error 
covariance matrices found from (8.4.7) and (8.4.9) with Yo = 1. 
To deal with the more general case of possibly irregularly spaced observations 


{Y;,,..., Yi,}, we introduce a new series {Y*}, related to the process {X,} by the 
modified observation equation 
Y; = G/X, + WF, 1+=1,2,..., (8.6.1) 
where 
G, att Sih a W, ifte {i,,...,i-}, 
GS j wr= (8.6.2) 
0 otherwise, N, otherwise, 


and {N,} is iid with 


V, 


N, a NO, Tuxw)s N, L X], N, ne Re 


| s,t=O0,41,.... (8.6.3) 


Equations (8.6.1) and (8.1.2) constitute a state-space representation for the new series 


{Y*}, which coincides with {Y,} at each ¢ € {ij,i2,...,i,}, and at other times takes 
random values that are independent of {Y,} with a distribution independent of 0. 
Let Lı (0; Yie Yi.) be the Gaussian likelihood based on the observed values 
Yas- Yi, of Y;,,..., Y;, under the model defined by (8.1.1) and (8.1.2). Corre- 
sponding to these observed values, we define a new sequence, y7,..., y;, by 
if t pooh gd 
ops maiek (8.6.4) 
0 otherwise. 


Then it is clear from the preceding paragraph that 
Li (0; Vices yp) = Qm) OY L Oy ei) (8.6.5) 


where L3 denotes the Gaussian likelihood under the model defined by (8.6.1) and 
(8.1.2). 

In view of (8.6.5) we can now compute the required likelihood L, of the realized 
values {y;, t =i,,...,i,-} as follows: 


i. Define the sequence {y7,¢ = 1,...,m} as in (8.6.4). 
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ii. Find the one-step predictors ý: of Y*, and their error covariance matrices A*, 
using Kalman prediction and equations (8.4.7) and (8.4.9) applied to the state- 
space representation (8.6.1) and (8.1.2) of {Y*}. Denote the realized values of the 
predictors, based on the observation sequence fy; k by {yz}. 

iii. The required Gaussian likelihood of the irregularly spaced observations {y;,,..., 
yi,} is then, by (8.6.5), 


-1/2 
—rw = 1 z ex7 =I 
La (0: Yas -s Ya) = Ory” (Te si) m l-3 Diya; “| l 
j=1 j=l 
where i; denotes the observed innovation yj — yj. EA eee T 


Example 8.6.1 An AR(1) series with one missing observation 


Let {Y,} be the causal AR(1) process defined by 
Y, — $Y, = Zr, {Z} ~ WN (0,07). 


To find the Gaussian likelihood of the observations y1, y3, y4, and ys of Y,, Y3, Y4, and 
Y; we follow the steps outlined above. 


i. Set y* = y;, i = 1,3,4, 5 and y3 = 0. 
ii. We start with the state-space model for {Y,} from Example 8.1.1, i.e., Y, = 
Xi, Xi41 = OX; + Z,41. The corresponding model for {Y,*} is then, from (8.6.1), 


VSG We t=1,2,..., 


where 


Xai = FX + V, bia dy Zeta, 


1 ift 42, 0 iff #2, 
F,=4, ai=| Vi = Zit, ie 
0 ift=2, N, ift=2, 
0 ift £2, 
Q, =0°, R= | 5. = 0, 
1 ift=2, 


and X, = } o ¢/Z,_;. Starting from the initial conditions 


X, =0, 2 =07/(1-¢’), 
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and applying the recursions (8.4.1) and (8.4.2), we find (Problem 8.19) that 
o/(1-¢’) ift=1, 
GATE ọ ift= 1,3,4,5, a lelrp r= 
0 ift=2, 
o? ift = 2,4,5, 
and 
X,=0, % =Y, % =Y, X= OY, Š; =OY;. 
From (8.4.7) and (8.4.9) with h = 1, we find that 
V0, È =0, È =Y, Êj=¢5, Ê= oY, 
with corresponding mean squared errors 
Ai =0°/(1-¢), &=1, A =0 (1+), Aj=0°, A =0°. 


iii. From the preceding calculations we can now write the likelihood of the original 
data as 


Li($,07: yi y3, Ya Ys) = 0 2r) [(1- 4°) /(1+¢°)] 


1 _ #2 2 
x exp | 505 | a-g) oe H 4 = dys)? + Os woh. 


1/2 


Remark 1. If we are given observations yi-4, Y2-4, <- -, Yos Yis Yin -- -> Yi, OF an 
ARIMA(p, d, q) process at times 1 — d, 2 — d, ...,0, i1,..., ip, Where 1 < i) < 
i <- <i, < n, a similar argument can be used to find the Gaussian likelihood of 
Yis ---, Yı, conditional on Y\_¢ = Yı-a, Yo-a = Yo-a,---, Yo = yo. Missing values 
among the first d observations yı-a4, yo-a,---, Yo can be handled by treating them as 
unknown parameters for likelihood maximization. For more on ARIMA series with 
missing values see TSTM and Ansley and Kohn (1985). 


Estimation of Missing Values for State-Space Models 


Given that we observe only Y;,, Y;,,..., Y, 1 <i) <i <- <i, <n, where {Y,} 
has the state-space representation (8.1.1) and (8.1.2), we now consider the problem 
of finding the minimum mean squared error estimators P (Y, [Yor Vesases Y;,) of Y,, 
1 < t <n, where Yo = 1. To handle this problem we again use the modified process 
{Y>} defined by (8.6.1) and (8.1.2) with Yj = 1. Since Y* = Y, for s € {i),...,i,} 
and Y* | X,, Yo for! < t <nands ¢ {0,i),...,i,}, we immediately obtain the 
minimum mean squared error state estimators 


P(X,1¥o, ¥;,,---, Yi) = P(M IVS, Y$, ..., Yt), 1<t<n. (8.6.6) 
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The right-hand side can be evaluated by application of the Kalman fixed-point smooth- 
ing algorithm to the state-space model (8.6.1) and (8.1.2). For computational purposes 
the observed values of Y*, t ¢ {0,i,,...,i,}, are quite immaterial. They may, for 
example, all be set equal to zero, giving the sequence of observations of Y* defined 
in (8.6.4). 

To evaluate P (Y,|Yo, Y;,,..., ¥;,), 1 < £ < n, we use (8.6.6) and the relation 
Y, = G,X; + W.. (8.6.7) 
Since E (V,W;) = S:=0, t=1,...,n, we find from (8.6.7) that 
P (Y;|Y0, Vases Yi) = GP Alyy, Vie, Y3). (8.6.8) 
Example 8.6.2 An AR(1) series with one missing observation 


Consider the problem of estimating the missing value Y, in Example 8.6.1 in terms of 
Yo = 1, Yı, Y3, Ys, and Y5. We start from the state-space model X;4; = @X; + Z1, 
Y, = X,, for {Y,}. The corresponding model for {Y,*} is the one used in Example 8.6.1. 
Applying the Kalman smoothing equations to the latter model, we find that 


o(Y + Y3) 
P%=9%1, PXe= oh, Pk = rg 
P4X2 = P3X2, P5X2 = P3X2, 
Qo. = 0", Q23 = go’, Q2,=0, 124, 
and 
52 
Qa =0°, Ryu =0°, Qy = ——, t23, 
d ++4°) 


where P,(-) here denotes P (IY, Sirds Y*) and Q, n, Qin are defined correspondingly. 
We deduce from (8.6.8) that the minimum mean squared error estimator of the missing 
value Y, is 


Y,+ Y. 
Ps¥y = PsXp = o(Y + 3) 
(1+?) 
with mean squared error 
Q Á 
j5 = a 
” (1+8) 
Remark 2. Suppose we have observations Y-a, Yo-a,---, Yo, Yn, --., Y, Ad < 
iy < ip- <i, <n) of an ARIMA (p, d, q) process. Determination of the best linear 
estimates of the missing values Y,, t ¢ {i1,...,i,}, in terms of Y,, t € {i1,...,i,}, 
and the components of Yo := (Yi-a, Yo-a,..., Yo)’ can be carried out as in Example 


8.6.2 using the state-space representation of the ARIMA series {Y,} from Example 
8.3.3 and the Kalman recursions for the corresponding state-space model for {Y;*} 
defined by (8.6.1) and (8.1.2). See TSTM for further details. 
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Example 8.6.3 


We close this section with a brief discussion of a direct approach to estimating 
missing observations. This approach is often more efficient than the methods just 
described, especially if the number of missing observations is small and we have 
a simple (e.g., autoregressive) model. Consider the general problem of computing 
E(X|Y) when the random vector (X’, Y’)’ has a multivariate normal distribution with 
mean 0 and covariance matrix X. (In the missing observation problem, think of X as 
the vector of the missing observations and Y as the vector of observed values.) Then 
the joint probability density function of X and Y can be written as 


fax & Y) = fr EINA), (8.6.9) 


where f,,, (xly) is a multivariate normal density with mean E(X|Y) and covariance 


matrix Xy (see Proposition A.3.1). In particular, 


fay &lY) = 


EXI E(x — EXI») , (8.6.10) 


1 1 
ex X 
Jany deta, | 2 xy 


where q = dim(X). It is clear from (8.6.10) that f,,(xly) (and also fx y(x, y)) 
is maximum when x = E(X|y). Thus, the best estimator of X in terms of Y can be 
found by maximizing the joint density of X and Y with respect to x. For autoregressive 
processes it is relatively straightforward to carry out this optimization, as shown in 
the following example. 


Estimating missing observations in an AR process 


Suppose {Y,} is the AR(p) process defined by 
Y, = 1Y- +--+ + PpYi-p + Zi, {Z,} ~ WN (0, 0”), 


and Y = (¥;,,..., ¥;,)’, with 1 <i, <--- <i, < n, are the observed values. If there 
are no missing observations in the first p observations, then the best estimates of the 
missing values are found by minimizing 


XO O,- b1Y-1 — opip) (8.6.11) 


t=p+1 


with respect to the missing values (see Problem 8.20). For the AR(1) model in Ex- 
ample 8.6.2, minimization of (8.6.11) is equivalent to minimizing 


Y — pY + (V3 — Yy 


with respect to Y. Setting the derivative of this expression with respect to Y» equal 
to 0 and solving for Y, we obtain E(Y2|Y1, Y3, Y4, Y5) = ¢ Y1 + Y3)/ (1 + ¢”). 
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8.7 The EM Algorithm 


The expectation-maximization (EM) algorithm is an iterative procedure for comput- 
ing the maximum likelihood estimator when only a subset of the complete data set is 
available. Dempster, Laird, and Rubin (1977) demonstrated the wide applicability of 
the EM algorithm and are largely responsible for popularizing this method in statis- 
tics. Details regarding the convergence and performance of the EM algorithm can be 
found in Wu (1983). 

In the usual formulation of the EM algorithm, the “complete” data vector W is 
made up of “observed” data Y (sometimes called incomplete data) and “unobserved” 
data X. In many applications, X consists of values of a “latent” or unobserved process 
occurring in the specification of the model. For example, in the state-space model of 
Section 8.1, Y could consist of the observed vectors Y;,..., Y, and X of the unob- 
served state vectors X,,..., X,. The EM algorithm provides an iterative procedure 
for computing the maximum likelihood estimator based only on the observed data Y. 
Each iteration of the EM algorithm consists of two steps. If 0® denotes the estimated 
value of the parameter 0 after i iterations, then the two steps in the (i + 1)th iteration 
are 


E-step. Calculate Q(0|0®) = Egw [€(@; X, Y)|Y] 
and 
M-step. Maximize Q(0|0®) with respect to 0. 


Then 6“*” is set equal to the maximizer of Q in the M-step. In the E-step, €(0; x, y) = 
In f(x, y; 8), and Egw(-|Y) denotes the conditional expectation relative to the condi- 
tional density f(xly; 0) = f (x, y; 0)/f(y; 0). 

It can be shown that (6; Y) is nondecreasing in i, and a simple heuristic 
argument shows that if 6 has a limit ô then @ must be a solution of the likelihood 
equations ¢’ (ô; Y) = 0. To see this, observe that In f (x, y; 0) = In f (xly; 0)+£(0; y), 
from which we obtain 


Q (6|0) = fo FIY; 0)) f (x1Y; 0) dx + £0; Y) 
and 
O'(6|0) = i Frc o| /f ONIY; 0) f (xl¥; 0®) dx + €'(6; Y). 


Now replacing 0 with 0“*”, noticing that Q’(6“T? 6) = 0, and letting i > 00, we 
find that 


o= f Sia: Ol- dx + (ô; Y) =e (ô; Y). 
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The last equality follows from the fact that 


a ð : 


The computational advantage of the EM algorithm over direct maximization of the 
likelihood is most pronounced when the calculation and maximization of the exact 
likelihood is difficult as compared with the maximization of Q in the M-step. (There 
are some applications in which the maximization of Q can easily be carried out 
explicitly.) 


Missing Data 


The EM algorithm is particularly useful for estimation problems in which there are 
missing observations. Suppose the complete data set consists of Y,,..., Y, of which 
r are observed and n — r are missing. Denote the observed and missing data by Y = 
(¥i,,..., Yi) and X = (Y;,,..., Yj,_,)’, respectively. Assuming that W = (X’, Y’)’ 
has a multivariate normal distribution with mean 0 and covariance matrix £, which 
depends on the parameter 0, the log-likelihood of the complete data is given by 


1 1 
(0; W) = = nOr) — 5 Indet() — 5W'EW. 


The E-step requires that we compute the expectation of £(0;, W) with respect to the 
conditional distribution of W given Y with 0 = 0. Writing E (0) as the block matrix 


which is conformable with X and Y, the conditional distribution of W given Y is 
multivariate normal with mean [š] and covariance matrix basa °], where X = 
Eo(X|Y) = Deus Y and Yyip(9) = Yay — Eriz X2; (see Proposition A.3.1). 
Using Problem A.8, we have 


Eoo [X', YNE (OX, YYY] = trace (Snp OEO) + WE OW, 


where Ŵ = (x. Y’) . It follows that 


; a 1 : 
Q (010) = £ (0, W) — trace (Ziua (0) £7). 


The first term on the right is the log-likelihood based on the complete data, but with 
X replaced by its “best estimate” X calculated from the previous iteration. If the 
increments 6+) — 9 are small, then the second term on the right is nearly constant 
(~ n — r) and can be ignored. For ease of computation in this application we shall 
use the modified version 


0 (016) = ¢(6:W). 
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With this adjustment, the steps in the EM algorithm are as follows: 


E-step. Calculate Eoo (X|Y) (e.g., with the Kalman fixed-point smoother) and form 
£ ( 0; Ww) ? 

M-step. Find the maximum likelihood estimator for the “complete” data problem, 
i.e., maximize £ (0 : W). For ARMA processes, ITSM can be used directly, with 
the missing values replaced with their best estimates computed in the E-step. 


The lake data 
It was found in Example 5.2.5 that the AR(2) model 
W, — 1.0415 W,_; + 0.2494W,—2 = Z,, {Z,} ~ WN(0, .4790) 


was a good fit to the mean-corrected lake data {W,}. To illustrate the use of the EM 
algorithm for missing data, consider fitting an AR(2) model to the mean-corrected 
data assuming that there are 10 missing values at times t = 17, 24, 31, 38, 45, 52, 
59, 66, 73, and 80. We start the algorithm at iteration O with o = = pP = Q. Since 
this initial model nE white noise, the first E-step gives, in the AGON used 
above, Wy = = = = We = = 0. Replacing the “missing” values of the mean-corrected 
lake data tt 0 and fitting a mean-zero AR(2) model to the resulting complete 
data set using the maximum likelihood option in ITSM, we find that go = £7252, 

P = = .0236. (Examination of the plots of the ACF and PACF of this new data set 
assent an AR(1) as a better model. This is also borne out by the small estimated 
value of #2.) The updated missing values at times tf = 17, 24,..., 80 are found (see 
Section 8.6 and Problem 8.21) by minimizing 


2 


A i 2 
a (Was — Q Wiaj- — PP Wasj-2) 


j=0 


with respect to W,. The solution is given by 


BS (Wha + Waa) + (81? = PEP) Wea + Wan) 


+R) 


The M-step of iteration 1 is then carried out by fitting an AR(2) model using ITSM 
applied to the updated data set. As seen in the summary of the results reported in Table 
8.1, the EM algorithm converges in four iterations with the final parameter estimates 
reasonably close to the fitted model based on the complete data set. (In Table 8.1, 
estimates of the missing values are recorded only for the first three.) Also notice 
how —2€ (6° us Ww) decreases at every iteration. The standard errors of the parameter 
estimates produced from the last iteration of ITSM are based on a “complete” data 
set and, as such, underestimate the true sampling errors. Formulae for adjusting the 


W, = 
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Table 8.1 


Estimates of the missing observations at times t = 17, 
24, 31 and the AR estimates using the EM algorithm in 


Example 8.7.1. 

iteration i Wiz Woy Wa, Q’ py —2¢ (0, W) 
0 0 0 322.60 
1 0 0 0 +7292 .0236 244.76 
2 .534 .205 .746 1.0729 —.2838 203.57 
3 458 .393 .821 1.0999 —.3128 202.25 
4 454 .405 .826 1.0999 —.3128 202.25 


standard errors to reflect the true sampling error based on the observed data can be 
found in Dempster, Laird, and Rubin (1977). 


8.8 Generalized State-Space Models 


As in Section 8.1, we consider a sequence of state variables {X,, t > 1} and a se- 
quence of observations {Y,, t > 1}. For simplicity, we consider only one-dimensional 
state and observation variables, since extensions to higher dimensions can be car- 
ried out with little change. Throughout this section it will be convenient to write 
Y® and X for the ¢ dimensional column vectors Y” = (Yı, Y2,..., Y,) and 
XO = (Xi, X2, ..., XV. 

There are two important types of state-space models, “parameter driven” and 
“observation driven,” both of which are frequently used in time series analysis. The 
observation equation is the same for both, but the state vectors of a parameter-driven 
model evolve independently of the past history of the observation process, while the 
state vectors of an observation-driven model depend on past observations. 


8.8.1 Parameter-Driven Models 


In place of the observation and state equations (8.1.1) and (8.1.2), we now make the 
assumptions that Y, given (X,, X“~, Y“~)) is independent of (X“~), Y“~?) with 
conditional probability density 


P(X) = p(ydxn x”, yo) t= Ly Direc uty (8.8.1) 
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and that X,+; given (x n XOD, Y®) is independent of (x =D, Y®) with conditional 
density function 


poral) = planil x, yO) t=1,2,.... (8.8.2) 


We shall also assume that the initial state X, has probability density pı. The joint 
density of the observation and state variables can be computed directly from (8.8.1)— 
(8.8.2) as 


oN) =) 


PO15-++> Yns Xis -<< Xn) = P (YnlXn, x"Diy p (n, xD y 
= pOnlxn) p (alx, y") p (y7, xP) 


= P (Yan) P (XnlXn-1)p iy , x7?) 


z (1 pole») ( posi) Pi(x), 
j=l j=2 


and since (8.8.2) implies that {X,} is Markov (see Problem 8.22), 


PQO1, +++) Ynl|X15---,Xn) = ( posle») (8.8.3) 

j=1 
We conclude that Y;,..., Y, are conditionally independent given the state variables 
X1,..., Xn, So that the dependence structure of {Y,} is inherited from that of the state 


process {X,}. The sequence of state variables {X,} is often referred to as the hidden 
or latent generating process associated with the observed process. 

In order to solve the filtering and prediction problems in this setting, we shall 
determine the conditional densities p (x;|y) of X, given Y®, and p (x;ly“~) of X, 
given YC), respectively. The minimum mean squared error estimates of X, based 
on Y® and Y“~” can then be computed as the conditional expectations, E (X,/Y”) 
and E (X,/¥°-)). 

An application of Bayes’s theorem, using the assumption that the distribution of 
Y, given (X,,X“~), YD) does not depend on (X“~?, Y“~)), yields 


P (xly) = P(yi|X1) p (xy ”) /P (ly) (8.8.4) 
and 


p (žmly”) = J OEO) pæn ance. (8.8.5) 


(The integral relative to djz(x;) in (8.8.4) is interpreted as the integral relative to dx, 
in the continuous case and as the sum over all values of x, in the discrete case.) The 
initial condition needed to solve these recursions is 


p (xily) := pir). (8.8.6) 
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Example 8.8.1 


The factor p (y,ly“~!) appearing in the denominator of (8.8.4) is just a scale factor, 
determined by the condition f p (x;ly) du(x:) = 1. In the generalized state-space 
setup, prediction of a future state variable is less important than forecasting a future 
value of the observations. The relevant forecast density can be computed from (8.8.5) 
as 


P (ve+ily”) = f P(yr+1 lXt+1)P (xily”) du(x1). (8.8.7) 


Equations (8.8.1)—(8.8.2) can be regarded as a Bayesian model specification. A 
classical Bayesian model has two key assumptions. The first is that the data Y|,..., Y,, 
given an unobservable parameter (X“ in our case), are independent with specified 
conditional distribution. This corresponds to (8.8.3). The second specifies a prior 
distribution for the parameter value. This corresponds to (8.8.2). The posterior 
distribution is then the conditional distribution of the parameter given the data. In 
the present setting the posterior distribution of the component X, of X is determined 
by the solution (8.8.4) of the filtering problem. 


Consider the simplified version of the linear state-space model of Section 8.1, 


Y,=GX,+W,, {W,} ~ iid N(O, R), (8.8.8) 


Xaa = FX, +V, {Vi} ~ tid NO, Q), (8.8.9) 


where the noise sequences {W,} and {V,} are independent of each other. For this model 
the probability densities in (8.8.1)-(8.8.2) become 


pix) = n(x; EX), Var(X))), (8.8.10) 
P Oix) = ny; Gx, R), (8.8.11) 
POXr411%1) = n&r; Fx, Q), (8.8.12) 


where n (x; LL, o°) is the normal density with mean u and variance o? defined in 
Example (a) of Section A.1. 

To solve the filtering and prediction problems in this new framework, we first 
observe that the filtering and prediction densities in (8.8.4) and (8.8.5) are both normal. 
We shall write them, using the notation of Section 8.4, as 


P (x1¥®) = NX; Xir Qie) (8.8.13) 
and 


P (fal X) = n (xei: Risi, Rra). (8.8.14) 
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From (8.8.5), (8.8.12), (8.8.13), and (8.8.14), we find that 
Ra = i; Xa PK LY) dx 41 
= | xn f PENO panax dein 
= p(xl¥) | parle) dzen | dx; 
=i, Fx, p(x,1Y©) dx, 
= FX 
and (see Problem 8.23) 
Qy41 = F’ Qu + Q. 
Substituting the corresponding densities (8.8.11) and (8.8.14) into (8.8.4), we find by 
equating the coefficient of x? on both sides of (8.8.4) that 
Qi = GR. + a =G R! + (F° Qrir- +0)! 
and 
Xi = È, + Qu GR! (x, z GX,) 
Also, from (8.8.4) with p (x;|y) = n(xı; EX1, Qı) we obtain the initial conditions 
Xin = EX) + QyiGR'(Y, — GEX)) 
and 
QU, = GR! ++ 071. 
The Kalman prediction and filtering recursions of Section 8.4 give the same results 
for X, and X,,, since for Gaussian systems best linear mean square estimation is 
equivalent to best mean square estimation. 
Example 8.8.2 A non-Gaussian example 


In general, the solution of the recursions (8.8.4) and (8.8.5) presents substantial com- 
putational problems. Numerical methods for dealing with non-Gaussian models are 
discussed by Sorenson and Alspach (1971) and Kitagawa (1987). Here we shall il- 
lustrate the recursions (8.8.4) and (8.8.5) in a very simple special case. Consider the 
state equation 


X, =aX,_1, (8.8.15) 
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with observation density 


(mx) eT 


yı! 


POX) = ye =S 0 ee (8.8.16) 
where z is a constant between 0 and 1. The relationship in (8.8.15) implies that the 
transition density (in the discrete sense—see the comment after (8.8.5)) for the state 
variables is 


1, if Xt41 = AX,, 
P41 lx) = 
0, otherwise. 


We shall assume that X, has the gamma density function 


Kerle 


z x, > 0. 
rœ) 


Pix) = gx; œ, à) = 


(This is a simplified model for the evolution of the number X, of individuals at 
time t infected with a rare disease, in which X, is treated as a continuous rather 
than an integer-valued random variable. The observation Y, represents the number of 
infected individuals observed in a random sample consisting of a small fraction x of 
the population at time t.) Because the transition distribution of {X,} is not continuous, 
we use the integrated version of (8.8.5) to compute the prediction density. Thus, 


P(X, <xly"?) = f P(X, < xla) p (xiily) dx 
0 


x/a 
sA poan iti 
0 


Differentiation with respect to x gives 

p (x:ly°~?) = a™' px, ye» (a7 xy). (8.8.17) 
Now applying (8.8.4), we find that 

Ply) = pOl pi@)/POv 


B (2) -> ( 1 ) 
E yı! r (a) p(y) 


= æa+yı—l —(n4+a)x 
= c(y1)x; EUr ENa 


xı > 0, 


where c(y,) is an integration factor ensuring that p(-|y,) integrates to 1. Since p(-|y,) 
has the form of a gamma density, we deduce (see Example (d) of Section A.1) that 


Ply) = g(x1; %, 1), (8.8.18) 
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Example 8.8.3 


where a; = a+ yı and à; = à + x. The prediction density, calculated from (8.8.5) 
and (8.8.18), is 


p (xaly°?) = a7! px yo (a7! x2Ly"?) 
= a'g (a7 'x9; æ, Ai) 
= 8(X7; @, A, /a). 
Iterating the recursions (8.8.4) and (8.8.5) and using (8.8.17), we find that for ¢ > 1, 


p (xy) = g (ars a, Ar) (8.8.19) 


and 


p (xnaily®) = a7! g (axi; r, Ar) 
== 8(Xr+1; Qr, Ar/a), (8.8.20) 


where a, = œi +y = œ +y +- + y and à; = à i/a +r = ìa!™ + 
T ( 1 — a~ ) /( — a™!). In particular, the minimum mean squared error estimate of 
x, based on y is the conditional expectation œ, /, with conditional variance a, X2. 
From (8.8.7) the probability density of Y,,, given Y® is 


CO V: ery 

(mx Dte TXt+1 

POnily®) = i; ( = Yaa! 8(Xi41; Qr, Ap/a) dXi41 
0 t+1> 


= D(a; + Y1) (1 x ye 1 a 
rT OY + 1) Att Atel 


= nb(yi413 tr, 1 — T/A), Yea =90,1,..., 


where nb(y; a, p) is the negative binomial density defined in example (i) of Sec- 
tion A.1. Conditional on Y, the best one-step predictor of Y;,, is therefore the 
mean, a@,7/(A;+1 — 7T), of this negative binomial distribution. The conditional mean 
squared error of the predictor is Var (Yi41 iy) = QT y41/(Ar41 — 17)? (see Problem 
8.25). 


A model for time series of counts 


We often encounter time series in which the observations represent count data. One 
such example is the monthly number of newly recorded cases of poliomyelitis in the 
U.S. for the years 1970-1983 plotted in Figure 8.6. Unless the actual counts are large 
and can be approximated by continuous variables, Gaussian and linear time series 
models are generally inappropriate for analyzing such data. The parameter-driven 
specification provides a flexible class of models for modeling count data. We now 
discuss a specific model based on a Poisson observation density. This model is similar 
to the one presented by Zeger (1988) for analyzing the polio data. The observation 
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Figure 8-6 
Monthly number of 
U.S. cases of polio, 

Jan. '70—-Dec. ’83. 


density is assumed to be Poisson with mean exp{%;}, i.e., 


e” eT 
PY: |%1) = a y =0,1,..., (8.8.21) 
te 
while the state variables are assumed to follow a regression model with Gaussian 
AR(1) noise. If u, = (uj, ..., ui)’ are the regression variables, then 
X, = Bu, + W,, (8.8.22) 


where (3 is a k-dimensional regression parameter and 
W, =W- +Z, {Z,}~IDN(0,o’). 
The transition density function for the state variables is then 
P(Xr411%1) = nX; Bum +o (x; — Bu), o°) . (8.8.23) 


The case o? = 0 corresponds to a log-linear model with Poisson noise. 

Estimation of the parameters 0 = (8', Q, o?) in the model by direct numerical 
maximization of the likelihood function is difficult, since the likelihood cannot be 
written down in closed form. (From (8.8.3) the likelihood is the n-fold integral, 


f -f ap | bo -«)Le x”) (dxi ++- dxa) f Toi). 
a m A i=l 


where L(@; x) is the likelihood based on X,..., X;,.) To overcome this difficulty, 
Chan and Ledolter (1995) proposed an algorithm, called Monte Carlo EM (MCEM), 
whose iterates 0“ converge to the maximum likelihood estimate. To apply this algo- 
rithm, first note that the conditional distribution of Y” given X” does not depend 


tL 
ed 
NL | 
= 


10 
T 
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| l i | | | | | | | 
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on @, so that the likelihood based on the complete data (X“”, Y””)’ is given by 
L (0; X®, ¥) = f (Y™|X) L (0; X”). 
The E-step of the algorithm (see Section 8.7) requires calculation of 
Q(010®) = Eoo (In L(0; X®, Y™) Y”) 
= Ego (In fY OXY”) + Eoo (In L(0; XOY”). 


We delete the first term from the definition of Q, since it is independent of 0 and 
hence plays no role in the M-step of the EM algorithm. The new Q is redefined as 


Q(010®) = Eoo (In L(0; X®) Y). (8.8.24) 


Even with this simplification, direct calculation of Q is still intractable. Suppose 
for the moment that it is possible to generate replicates of X“) from the conditional 
distribution of X” given Y”) when 0 = 6. If we denote m independent replicates of 
X” by X”, ..., X, then a Monte Carlo approximation to Q in (8.8.24) is given by 


Qn (0/0) = nee x”). 


The M-step is easy to carry out using Q,, in place of Q (especially if we condition 
on X; = 0 in all the simulated replicates), since L is just the Gaussian likelihood of 
the regression model with AR(1) noise treated in Section 6.6. The difficult steps in 
the algorithm are the generation of replicates of X” given Y” and the choice of m. 
Chan and Ledolter (1995) discuss the use of the Gibb’s sampler for generating the 
desired replicates and give some guidelines on the choice of m. 

In their analyses of the polio data, Zeger (1988) and Chan and Ledolter (1995) 
included as regression components an intercept, a slope, and harmonics at periods of 
6 and 12 months. Specifically, they took 


u, = (1, 2/1000, cos(27t/12), sin(27t/12), cos(27t/6), sin(2mt/6))’. 


The implementation of Chan and Ledolter’s MCEM method by Kuk and Cheng 
(1994) gave estimates 3 = (.247, —3.871, .162, —.482, .414, —.011)’, œ = .648, and 
6? = .281. The estimated trend function ĝ'u, is displayed in Figure 8.7. The negative 
coefficient of t/1000 indicates a slight downward trend in the monthly number of 
polio cases. 


8.8.2 Observation-Driven Models 


In an observation-driven model it is again assumed that Y,, conditional on (x n XED, 
Y~?), is independent of (X“~”, Y“-"). The model is specified by the conditional 
densities 


Pix) = Diy ix, ¥ Ps D) ’ t= 1, 2, Syne (8.8.25) 
P(sail¥) = Prawo ely”); t=0,1,..., (8.8.26) 
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Figure 8-7 

Trend estimate for the 
monthly number of 
U.S. cases of polio, 
Jan. ‘70—Dec. ‘83. 
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where p (xı ly) ‘= pı(xı) for some prespecified initial density p,(x;). The advan- 
tage of the observation-driven state equation (8.8.26) is that the posterior distribution 
of X, given Y” can be computed directly from (8.8.4) without the use of the updat- 
ing formula (8.8.5). This then allows for easy computation of the forecast function 


in (8.8.7) and hence of the joint density function of (Y;,..., Ypy’, 
pOr apo J p (nly). (8.8.27) 
t=1 


On the other hand, the mechanism by which the state X,_, makes the transition to 
X, is not explicitly defined. In fact, without further assumptions there may be state 
sequences {X,} and {X*} with different distributions for which both (8.8.25) and 
(8.8.26) hold (see Example 8.8.6). Both sequences, however, lead to the same joint 
distribution, given by (8.8.27), for Y\,..., Y,. The ambiguity in the specification of 
the distribution of the state variables can be removed by assuming that X,+; given 
(X, Y) is independent of X”, with conditional distribution (8.8.26), i.e., 


P (eerie y“) = Py wo (x4i:ly) (8.8.28) 
With this modification, the joint density of Y™ and X is given by (cf. (8.8.3)) 


p (y®, x”) = P(ValXn)P (xn ly”) p (yo, x) 


n 


| [ orld Q@ily~?)). 


t=1 
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Example 8.8.4 An AR(1) process 
An AR(1) process with iid noise can be expressed as an observation driven model. 
Suppose {Y,} is the AR(1) process 
Y,=@Y,1+Z), 
where { Z,} is an iid sequence of random variables with mean 0 and some probability 
density function f(x). Then with X, := Y,—ı we have 
POX) = FOr — x) 
and 
l, ifx4i1=y, 
p (žmily”) = l 
0, otherwise. 
Example 8.8.5 Suppose the observation-equation density is given by 


x eux 
Pee DPS (8.8.29) 


te 


and the state equation (8.8.26) is 
P (Xrstly) = 8r; ær, Ar), (8.8.30) 


where œ, = œ + yı +--+ y; and A, = A + t. It is possible to give a parameter- 
driven specification that gives rise to the same state equation (8.8.30). Let {X*} be the 
parameter-driven state variables, where X¥ = X7_, and Xj has a gamma distribution 
with parameters œ and à. (This corresponds to the model in Example 8.8.2 with 
x = a = 1.) Then from (8.8.19) we see that p(x*ly®) = g(x*; œr, àr), which 
coincides with the state equation (8.8.30). If {X,} are the state variables whose joint 
distribution is specified through (8.8.28), then {X,} and {X*} cannot have the same 
joint distributions. To see this, note that 
er | Pee be eae 

xal) = 
zi ml r) 0, otherwise, 


while 


P (xmi lx®, y“) =P (xily”) = g (Xr; Or, Ar). 


If the two sequences had the same joint distribution, then the latter density could take 
only the values 0 and 1, which contradicts the continuity (as a function of x,) of this 
density. 


Exponential Family Models 


The exponential family of distributions provides a large and flexible class of distri- 
butions for use in the observation equation. The density in the observation equation 
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is said to belong to an exponential family (in natural parameterization) if 


POX) = exp{yx, — b(x,) + c(y)}, (8.8.31) 


where b(-) is a twice continuously differentiable function and c(y,) does not depend 
on x,. This family includes the normal, exponential, gamma, Poisson, binomial, and 
many other distributions frequently encountered in statistics. Detailed properties of 
the exponential family can be found in Barndorff-Nielsen (1978), and an excellent 
treatment of its use in the analysis of linear models is given by McCullagh and Nelder 
(1989). We shall need only the following important facts: 


e = f explos +000) vid) (8.8.32) 
b'(x:) = EQ, |x), (8.8.33) 
b") = Var(Y, |x) := f y? p(velx,) vidy) — [bæ (8.8.34) 


where integration with respect to v(dy,) means integration with respect to dy, in the 
continuous case and summation over all values of y, in the discrete case. 


Proof of (8.8.32)—(8.8.34) 


Example 8.8.6 


The first relation is simply the statement that p(y,|x,) integrates to 1. The second rela- 
tion is established by differentiating both sides of (8.8.32) with respect to x, and then 
multiplying through by e~?“” (for justification of the differentiation under the integral 
sign see Barndorff-Nielson (1978)). The last relation is obtained by differentiating 
(8.8.32) twice with respect to x, and simplifying. a 


The Poisson case 


If the observation Y,, given X, = x;,, has a Poisson distribution of the form (8.8.21), 
then 


POX) = exp{ yx; — e" — ln ye!}, o A (8.8.35) 


which has the form (8.8.31) with b(x,) = e” and c(y,) = — 1n y,!. From (8.8.33) we 
easily find that E (Y, |x) = b’(x,) = e“. This parameterization is slightly different 
from the one used in Examples 8.8.2 and 8.8.5, where the conditional mean of Y, given 
xX; Was mx, and not e*. For this observation equation, define the family of densities 


f(x; a, A) = exp{ax —Ab(x) + A(a,A)}, —-0O <x < œ, (8.8.36) 


where œ > 0 and à > 0 are parameters and A(«, à) = —InI'(@) + alnid. Now 
consider state densities of the form 


Psly”) = f (X1; Mpa, Arile), (8.8.37) 
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Proof 


where a4 1), and 4,41), are, for the moment, unspecified functions of y”. (The subscript 
t+ 1|t on the parameters is a shorthand way to indicate dependence on the conditional 
distribution of X,,; given Y”.) With this specification of the state densities, the 
parameters &œ;+ıų are related to the best one-step predictor of Y, through the formula 


Orstte/ Pepi = You = E (Yaily®). (8.8.38) 


We have from (8.8.7) and (8.8.33) that 


Ea Y= >) f Ya POl) p (ly) dxi 


Yı+1=0 


= b (x41) P (xaly”) dXx141. 


Addition and subtraction of 0,414;/A;+1): then gives 


(oe) 


1 OQŒt+1 O41 
BG) = J (» wel . £) P (xmily®) dx1ı + 7 f 
= tilt t+1\t 
CO 
= 1 Œt+1] 
=f AiP (x4ily”) AX 41 + = d 
TOL t+1jt 
= [àge (rO) ET, + SA 
= HIP ily Xr+1=— 00 Artie 
b+ 1 Ir = 
Aiit 


Letting Aj) = A(@rr-1, Art-1), we can write the posterior density of X, given 
Y as 


p (xy) = exp{yx, — B(x) + eC} Exp{ar-1Xr — An—1D(%) 
+ Agu—1}/p (ydy?) 
= exp{A,), (Cates — b(x,)) — Ag}, 
= f (Xr; Qr, Ar), 


where we find, by equating coefficients of x, and b(x,), that the coefficients 4, and a, 
are determined by 


Àr = 1+ Atlt—1s (8.8.39) 
Oy = Yt + Op ir-1. (8.8.40) 


The family of prior densities in (8.8.37) is called a conjugate family of priors for 
the observation equation (8.8.35), since the resulting posterior densities are again 
members of the same family. 
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As mentioned earlier, the parameters a,_; and 4,),_; can be quite arbitrary: Any 
nonnegative functions of y“~!) will lead to a consistent specification of the state 
densities. One convenient choice is to link these parameters with the corresponding 
parameters of the posterior distribution at time t — 1 through the relations 


Areilt = 6A; (= ô(1 F ie=1)) , (8.8.41) 
Arij = SO (= ô (Yt + 11-1) > (8.8.42) 


where 0 < 6 < 1 (see Remark 4 below). Iterating the relation (8.8.41), we see that 


Arpi = OC + Agi) = 6 + bagi 
=6+6(6 + dA,_21-2) 


= 6467 +- +8 4+ S Aio (8.8.43) 
— ô/(1 — ô) 
as t —> oo. Similarly, 
Ait = by, + SQ yp—1 
= by, +8 yyy tes + e yi + Saijo. (8.8.44) 


For large t, we have the approximations 


Arti = 6/1 — ô) (8.8.45) 
and 
t—1 l 
Arrie = 8 > yj, (8.8.46) 
j=0 


which are exact if Ayjo = ô/(1 — ô) and ajo = 0. From (8.8.38) the one-step predictors 
are linear and given by 


Orsi — X j=o 8 Yj + 8 ton 0 
Artile Dzi 8i + ôt! Aio 


Replacing the denominator with its limiting value, or starting with A190 = 6/(1 — ô), 
we find that Y,,, is the solution of the recursions 


Îi = (8.8.47) 


Êi = (A 8)y +8, t=1,2,..., (8.8.48) 


with initial condition Îi = (1 — 8)5~'a 4). In other words, under the restrictions 
of (8.8.41) and (8.8.42), the best one-step predictors can be found by exponential 
smoothing. 
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Remark 1. The preceding analysis for the Poisson-distributed observation equa- 
tion holds, almost verbatim, for the general family of exponential densities (8.8.31). 
(One only needs to take care in specifying the correct range for x and the allowable 
parameter space for a and à in (8.8.37).) The relations (8.8.43)—-(8.8.44), as well 
as the exponential smoothing formula (8.8.48), continue to hold even in the more 
general setting, provided that the parameters @,,_; and A,,_; satisfy the relations 
(8.8.41)—(8.8.42). 


Remark 2. Equations (8.8.41)—(8.8.42) are equivalent to the assumption that the 
prior density of X, given y“~) is proportional to the 5-power of the posterior distri- 
bution of X,_, given Y“~”, or more succinctly that 


fxs; Qtt—1, Àrt-1) = fe 5l;—1\r—-15 OAr—1r-1) 
xX PRS Qy—-1\t-15 Ar—jr—1)> 


This power relationship is sometimes referred to as the power steady model (Grun- 
wald, Raftery, and Guttorp, 1993, and Smith, 1979). 


Remark 3. The transformed state variables W, = e*' have a gamma state density 
given by 


P (waly) = g (Witi; Arties Artile) 
(see Problem 8.26). The mean and variance of this conditional density are 


E (Wily ®) = Qı and Var (Waily®) = Cer Air 


Remark 4. If we regard the random walk plus noise model of Example 8.2.1 as 
the prototypical state-space model, then from the calculations in Example 8.8.1 with 
G = F = 1, we have 


E (Xm [¥®) SB") 
and 
Var (X1 |¥®) = Var (X,/¥) + Q > Var (X,/¥). 


The first of these equations implies that the best estimate of the next state is the same 
as the best estimate of the current state, while the second implies that the variance 
increases. Under the conditions (8.8.41), and (8.8.42), the same is also true for the 
state variables in the above model (see Problem 8.26). This was, in part, the rationale 
behind these conditions given in Harvey and Fernandes (1989). 


Remark 5. While the calculations work out neatly for the power steady model, 
Grunwald, Hyndman, and Hamza (1994) have shown that such processes have de- 
generate sample paths for large t. In the Poisson example above, they argue that the 
observations Y, converge to 0 as £ —> œœ (see Figure 8.12). Although such models 
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may still be useful in practice for modeling series of moderate length, the efficacy of 
using such models for describing long-term behavior is doubtful. 


Goals scored by England against Scotland 


The time series of the number of goals scored by England against Scotland in soccer 
matches played at Hampden Park in Glasgow is graphed in Figure 8.8. The matches 
have been played nearly every second year, with interruptions during the war years. 
We will treat the data yı, ..., ys. as coming from an equally spaced time series model 
{Y,}. Since the number of goals scored is small (see the frequency histogram in Figure 
8.9), a model based on the Poisson distribution might be deemed appropriate. The 
observed relative frequencies and those based on a Poisson distribution with mean 
equal to ys. = 1.269 are contained in Table 8.2. The standard chi-squared goodness 
of fit test, comparing the observed frequencies with expected frequencies based on 
a Poisson model, has a p-value of .02. The lack of fit with a Poisson distribution is 
hardly unexpected, since the sample variance (1.652) is much larger than the sample 
mean, while the mean and variance of the Poisson distribution are equal. In this case 
the data are said to be overdispersed in the sense that there is more variability in 
the data than one would expect from a sample of independent Poisson-distributed 
variables. Overdispersion can sometimes be explained by serial dependence in the 
data. 

Dependence in count data can often be revealed by estimating the probabilities of 
transition from one state to another. Table 8.3 contains estimates of these probabilities, 
computed as the average number of one-step transitions from state y; to state y,+1. If 
the data were independent, then in each column the entries should be nearly the same. 
This is certainly not the case in Table 8.3. For example, England is very unlikely to 
be shut out or score 3 or more goals in the next match after scoring at least 3 goals in 
the previous encounter. 

Harvey and Fernandes (1989) model the dependence in this data using an obser- 
vation-driven model of the type described in Example 8.8.6. Their model assumes a 


Relative frequency and fitted Poisson 
distribution of goals scored by England 
against Scotland 


Number of goals 


Oo 1 2 3 4 i5 


Relative frequency .288 .423 .154 .019 .096 .019 


Poisson distribution .281 .356 .226 .096 .030 .008 
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Poisson observation equation and a log-gamma state equation: 


exp{yx, — e" } 
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Table 8.3 


Transition probabilities for the 
number of goals scored by England 
against Scotland. 


Yt 
PY+il¥) 0 1 2 -S3 
0 214 .500 .214 .072 
y 1 409 .272 .136 .182 
2 250 375 .125 .250 
>3 0 .857 143 0 


fort = 1,2,..., where f is given by (8.8.36) and ajo = 0, ijo = 0. The power 
steady conditions (8.8.41)—(8.8.42) are assumed to hold for a,,_; and A,,_1. The only 
unknown parameter in the model is ô. The log-likelihood function for 6 based on the 
conditional distribution of y,,..., ys2 given yı is given by (see (8.8.27)) 


n—1 
e (8, y®) = X In p (ymily®), (8.8.49) 


t=1 


where p (y,41:ly’) is the negative binomial density (see Problem 8.25(c)) 


P (yaily”) = nb (yi; Ari (L + Aisi) ') , 


with a1), and À;+iy as defined in (8.8.44) and (8.8.43). (For the goal data, yı = 0, 
which implies a2); = 0 and hence that p (y2| y”) is a degenerate density with unit 
mass at y2 = 0. Harvey and Fernandes avoid this complication by conditioning the 
likelihood on y®, where z is the time of the first nonzero data value.) 

Maximizing this likelihood with respect to 5, we obtain ô = .844. (Starting the 
equations (8.8.43)—(8.8.44) with ajo = 0 and Ajjo = 6/(1 — ô), we obtain § = .732.) 
With .844 as our estimate of 5, the prediction density of the next observation Y53 given 
y is nb(ys3; 53/52, (1 +5352) |. The first five values of this distribution are given in 
Table 8.4. Under this model, the probability that England will be held scoreless in the 
next match is .471. The one-step predictors, Y, = 0, Y2, ..., Ys2 are graphed in Figure 
8.10. (This graph can be obtained by using the ITSM option Smooth>Exponential 
with œ = 0.154.) 

Figures 8.11 and 8.12 contain two realizations from the fitted model for the goal 
data. The general appearance of the first realization is somewhat compatible with the 
goal data, while the second realization illustrates the convergence of the sample path 
to 0 in accordance with the result of Grunwald et al. (1994). 
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Table 8.4 Prediction density of Y53 given Y°) for data in 


Figure 8.7. 
Number of goals 
0 1 2 3 4 5 
pyssy?) 472 326 138 .046 013 004 


Example 8.8.8 The exponential case 
Suppose Y, given X, has an exponential density with mean —1/X, (X, < 0). The 
observation density is given by 


POilx) = exp{y,x, + In(—x,)}, y: >00, 


which has the form (8.8.31) with b(x) = — In(—x) and c(y) = 0. The state densities 
corresponding to the family of conjugate priors (see (8.8.37)) are given by 


P (x411y) = explorer — Arb Or) + Ari}, = <x <0. 


Goals 


Figure 8-10 
One-step predictors 
of the goal data. 
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Figure 8-11 

A simulated time 
series from the fitted g | | | ' f f | | | 
model to the goal data. 


(Here p(x,4:|y) is a probability density when 41, > 0 and Arı > —1.) The 
one-step prediction density is 


0 
p (Y1 Iy“) = f eye In (xp 41) HAr X—Arg bA) dX44 


—0o 


Arti tl —histip—2 
= Artie + Dose Ora E Ari) S, Y> 
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th 
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Figure 8-12 
A second simulated time © 
series from the fitted SŪ 


model to the goal data. 
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Problems 


(see Problem 8.28). While E(Yi41ly) = Or41/Arqiy, the conditional variance is 
finite if and only if A,41); > 1. Under assumptions (8.8.41)—(8.8.42), and starting with 


Ajo = 6/(1 — ô), the exponential smoothing formula (8.8.48) remains valid. 


8.1. 


8.2. 


8.3. 


8.4. 


8.5. 


8.6. 


Show that if all the eigenvalues of F are less than 1 in absolute value (or 
equivalently that F% + Oas k — oo), the unique stationary solution of equation 
(8.1.11) is given by the infinite series 


X, = } F’ Vj- 
j=0 


and that the corresponding observation vectors are 
2 . 
Y, = W, + > GRIN 4: 
j=0 


Deduce that {(X/, Y/)’} is a multivariate stationary process. (Hint: Use a vector 
analogue of the argument in Example 2.2.1.) 


In Example 8.2.1, show that 6 = —1 if and only if o? = 0, which in turn is 
equivalent to the signal M, being constant. 


Let F be the coefficient of X, in the state equation (8.3.4) for the causal AR(p) 
process 


Xi — Pi X11 — +++ — PpXt-p = Zi, {Zi} ~ WN (0,07). 
Establish the stability of (8.3.4) by showing that 
det(zI — F) = z?¢(z"), 


and hence that the eigenvalues of F are the reciprocals of the zeros of the 
autoregressive polynomial (z) = 1 — ız — --- — p2”. 


By following the argument in Example 8.3.3, find a state-space model for {Y;} 
when {VVj2¥,} is an ARMA(), q) process. 


For the local linear trend model defined by equations (8.2.6)—(8.2.7), show that 
V?Y, = (1 — B)’Y, is a 2-correlated sequence and hence, by Proposition 2.1.1, 
is an MA(2) process. Show that this MA(2) process is noninvertible if o? = 0. 


a. For the seasonal model of Example 8.2.2, show that ViY, = Y, — Y,—a4 is an 
MA(1) process. 
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8.7. 


8.8. 
8.9. 


8.10. 


8.11. 


8.12. 


b. Show that VV4Y, is an MA(d + 1) process where {Y,} follows the seasonal 
model with a local linear trend as described in Example 8.2.3. 


Let {Y,} be the MA(1) process 
Y,=Z,+0Z,1, {Z} ~ WN (0, 0°). 
Show that {Y,} has the state-space representation 
Y,=[1 0]X,, 


where {X,} is the unique stationary solution of 


0 1 1 
X = i 5 fe [a |z 


In particular, show that the state vector X, can written as 


x= ola] 


Verify equations (8.3.16)—(8.3.18) for an ARIMA(1,1,1) process. 
Consider the two state-space models 
Xais FiXa + Va, 
h = GiXa + Wi, 
and 
X125 Xn + Vio, 
| Y2 = GX + Wr, 
where {(V’,, W’ 


tl? tl?’ 


for {(Y/,, Y!,)’}. 


tl? 


V>, W/,)'} is white noise. Derive a state-space representation 


Use Remark 1 of Section 8.4 to establish the linearity properties of the operator 
P, stated in Remark 3. 


a. Show that if the matrix equation XS = B can be solved for X, then X = BS~! 
is a solution for any generalized inverse S7! of S. 

b. Use the result of (a) to derive the expression for P(X|Y) in Remark 4 of 
Section 8.4. 


In the notation of the Kalman prediction equations, show that every vector of 
the form 


Y=A,X,+---+A;X, 
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8.13. 


8.14. 


8.15. 


8.16. 


8.17. 


8.18. 


8.19. 
8.20. 


8.21. 


can be expressed as 
Y= BX +-+ B,-1ıX;-1 + CL, 


where B,,..., B;-; and C, are matrices that depend on the matrices A;,..., Ay. 
Show also that the converse is true. Use these results and the fact that E (X,I,) = 
0 for all s < t to establish (8.4.3). 


In Example 8.4.1, verify that the steady-state solution of the Kalman recursions 


(8.1.2) is given by Q, = (0? + Jfoe+ 40202) /2. 


Show from the difference equations for Q, in Example 8.4.1 that (Q,+1— Q) (Q, — 
Q) > 0 for all Q, > 0, where Q is the steady-state solution for Q, given in 
Problem 8.13. 


Show directly that for the MA(1) model (8.2.3), the parameter @ is equal to 
(202 to? — /o4 4 40202) / (202), which in turn is equal to —o? /(Q + 


o2), where Q is the steady-state solution for Q, given in Problem 8.13. 


Use the ARMA(0,1,1) representation of the series {Y,} in Example 8.4.1 to show 
that the predictors defined by 


Y,,,=aY, +(1—a)Y,, eal ee reer 
where a = 2/(Q+ 02), satisfy 


A 


Yai — Prat = Zn + (1 — a)" (Yo — Zo — 21). 


Deduce that if 0 < a < 1, the mean squared error of You converges to Q + ož 
for any initial predictor Y, with finite mean squared error. 


a. Using equations (8.4.1) and (8.4.10), show that Xa = FX,- 
b. From (a) and (8.4.10) show that X,, satisfies the recursions 


X; = Fy Xp 4-1 IE 2GA (Y, a G, Fi—1Xi—1-1) 
for t = 2,3,..., with Xj, =X; + QG A7! (xı = GX), 


In Section 8.5, show that for fixed Q*, —21n L (u, Q*, oĉ) is minimized when 
u and ož are given by (8.5.10) and (8.5.11), respectively. 
Verify the calculation of ©, A7! and Q, in Example 8.6.1. 


Verify that the best estimates of missing values in an AR(p) process are found 
by minimizing (8.6.11) with respect to the missing values. 


Suppose that {Y,} is the AR(2) process 
Y, = QY +Y + Zo {Z} ~ WN (0, 0°), 
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8.22. 


8.23. 


8.24. 


8.25. 


8.26. 


8.27. 


and that we observe Y1, Y2, Y4, Ys, Ye, Y7. Show that the best estimator of Y3 is 


(HY + Ys) + (bi — pio) a + Ys) /(1 +o; +45). 


Let X, be the state at time t of a parameter-driven model (see (8.8.2)). Show 
that {X,} is a Markov chain and that (8.8.3) holds. 


For the generalized state-space model of Example 8.8.1, show that Q,4; = 
F’ Qiu + Q. 


If Y and X are random variables, show that 


Var(Y) = E(Var(Y|X)) + Var(E(Y|X)). 


Suppose that Y and X are two random variables such that the distribution of Y 
given X is Poisson with mean 2X, 0 < x < 1, and X has the gamma density 
8 (x5 at, A). 


a. Show that the posterior distribution of X given Y also has a gamma density 
and determine its parameters. 


b. Compute E(X|Y) and Var(X|Y). 
c. Show that Y has a negative binomial density and determine its parameters. 
d. Use (c) to compute E(Y) and Var(Y). 
e. Verify in Example 8.8.2 that E (¥,.:/¥) = æ, / (Arı — 7) and 
Var (Yin 1¥) = 0,741 / Are — m)’. 
For the model of Example 8.8.6, show that 
a. E (Xi4:1¥) = E (X,/¥), Var(X, 1Y ®) >Var(X, |Y”), and 
b. the transformed sequence W, = e* has a gamma state density. 
Let {V,} be a sequence of independent exponential random variables with EV, = 
t~' and suppose that {X,, t > 1} and {Y,, t > 1} are the state and observation 
random variables, respectively, of the parameter-driven state-space system 
X,=WVi, 
X= Xi + V, AE ER 
where the distribution of the observation Y,, conditional on the random variables 
Yı, Yo,..., Y,—1, X;, is Poisson with mean X,. 


a. Determine the observation and state transition density functions p(y,|x;) and 
P(%;41|x;) in the parameter-driven model for {Y,}. 


b. Show, using (8.8.4)-(8.8.6), that 


paly) = 8x; yı +1, 2) 
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and 


P(xaly1) = g(%23 yı +2, 2), 


where g(x; œ, à) is the gamma density function (see Example (d) of Section 


A.1). 
c. Show that 
P(xAly®) = 8x a + t,t +1) 
and 


P (xily”) = g(%4150 +t +1,t +1), 


where a, = yi +--+: + y. 
d. Conclude from (c) that the minimum mean squared error estimates of X, and 


Xı+ı based on Yj,..., Y, are 
t+Y,+---4+Y, 
Kie 1 t 
t+1 
and 
a t+1+Y,4 HY, 
X1 = ; 
i t+1 
respectively. 


8.28. Let Y and X be two random variables such that Y given X is exponential with 
mean 1/X, and X has the gamma density function with 


ætl y? exp{—ax} 


rat) 


g(x;à + 1,œ) = , x>0, 


where A > —1 anda > 0. 
a. Determine the posterior distribution of X given Y. 
b. Show that Y has a Pareto distribution 


pO) = A+A O +a, y>0. 


c. Find the mean of variance of Y. Under what conditions on «œ and A does the 
latter exist? 


d. Verify the calculation of p (y,4:ly) and E (Y,4:ly) for the model in Ex- 
ample 8.8.8. 


8.29. Consider an observation-driven model in which Y, given X, is binomial with 
parameters n and X;, i.e., 


n Yt n—y 
powo = (")x d= xp”, y =0,1,...,7. 


t 


316 Chapter 8 — State-Space Models 


a. Show that the observation equation with state variable transformed by the 
logit transformation W, = In(X,/(1 — X,)) follows an exponential family 


P(y:|wr) = exp{y,w, — b(w,) + c(y,)}. 


Determine the functions b(-) and c(-). 
b. Suppose that the state X, has the beta density 


Ply”) = f(X; Artile Artie) 
where 
f&a, Aà) =[B@, A) x "bax !, O<x <1, 


B(a, à) := T(@)P(A)/T(@ +A) is the beta function, and a, A > 0. Show that 
the posterior distribution of X, given Y, is also beta and express its parameters 
in terms of y; and 41,1, Agi—1. 

c. Under the assumptions made in (b), show that E(X,/¥Y®) = E(X,.:/¥”) 
and Var(X,/¥) <Var(X,.:/¥). 

d. Assuming that the parameters in (b) satisfy (8.8.41)—(8.8.42), show that the 
one-step prediction density p(y,+1ly“) is beta-binomial, 


B (Qipi + 41s Arete FM Yr41) 
(n+ IB y41 + 1,n— Vit 1) Bay +11, Artile) 


POmly”) = 


and verify that Ê, is given by (8.8.47). 


Forecasting Techniques 


9.1. The ARAR Algorithm 

9.2 The Holt-Winters Algorithm 

9.3 The Holt-Winters Seasonal Algorithm 
9.4 Choosing a Forecasting Algorithm 


We have focused until now on the construction of time series models for stationary 
and nonstationary series and the determination, assuming the appropriateness of these 
models, of minimum mean squared error predictors. If the observed series had in 
fact been generated by the fitted model, this procedure would give minimum mean 
squared error forecasts. In this chapter we discuss three forecasting techniques that 
have less emphasis on the explicit construction of a model for the data. Each of the 
three selects, from a limited class of algorithms, the one that is optimal according to 
specified criteria. 

The three techniques have been found in practice to be effective on wide ranges 
of real data sets (for example, the economic time series used in the forecasting com- 
petition described by Makridakis et al., 1984). 

The ARAR algorithm described in Section 9.1 is an adaptation of the ARARMA 
algorithm (Newton and Parzen, 1984; Parzen, 1982) in which the idea is to apply 
automatically selected “memory-shortening” transformations (if necessary) to the 
data and then to fit an ARMA model to the transformed series. The ARAR algorithm 
we describe is a version of this in which the ARMA fitting step is replaced by the 
fitting of a subset AR model to the transformed data. 

The Holt—Winters (HW) algorithm described in Section 9.2 uses a set of simple 
recursions that generalize the exponential smoothing recursions of Section 1.5.1 to 
generate forecasts of series containing a locally linear trend. 
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The Holt—Winters seasonal (HWS) algorithm extends the HW algorithm to handle 
data in which there are both trend and seasonal variation of known period. It is 
described in Section 9.3. 

The algorithms can be applied to specific data sets with the aid of the ITSM op- 
tions Forecasting>ARAR, Forecasting>Holt-Winters and Forecasting> Sea- 
sonal Holt-Winters. 


The ARAR Algorithm 


9.1.1 Memory Shortening 


Given a data set {Y,, t = 1,2, ...,}, the first step is to decide whether the underlying 
process is “long-memory,” and if so to apply a memory-shortening transformation be- 
fore attempting to fit an autoregressive model. The differencing operations permitted 
under the option Transform of ITSM are examples of memory-shortening transfor- 
mations; however, the ones from which the option Forecasting>ARAR selects are 
members of a more general class. There are two types allowed: 


Y, = Y, — ọ (ĉ) Yi (9.1.1) 
and 
Y, = Y, — b1Y,-1 — bo¥,-2. (9.1.2) 


With the aid of the five-step algorithm described below, we classify {Y,} and take 
one of the following three courses of action: 


e L. Declare {Y,} to be long-memory and form {¥,} using (9.1.1). 
e M. Declare {Y,} to be moderately long-memory and form {7,1 using (9.1.2). 
e S. Declare {Y,} to be short-memory. 


If the alternative L or M is chosen, then the transformed series {¥,} is again 
checked. If itis found to be long-memory or moderately long-memory, then a further 
transformation is performed. The process continues until the transformed series is 
classified as short-memory. At most three memory-shortening transformations are 
performed, but it is very rare to require more than two. The algorithm for deciding 
among L, M, and S can be described as follows: 


1. For each t = 1, 2,..., 15, we find the value b(t) of ġ that minimizes 


Deri LY: — O% 27? 


2 
t=t+1 Y; 


ERR(¢, t) = 


We then define 
Err(t) = ERR($(t), t) 
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and choose the lag Tf to be the value of t that minimizes Err(r). 

2. If Err(ĉ) < 8/n, go to L. 

. If o(t) > .93 and ĉ > 2, go to L. 

4. If ọ(?) > .93 and ĉ = 1 or 2, determine the values ẹ; and ¢» of ¢; and ¢» that 
minimize )~"_,[Y, — $1 ¥;-1 — @2Y;-2]*; then go to M. 

5. If o(t) < .93, go to S. 


w 


9.1.2 Fitting a Subset Autoregression 


Let {S,, t = k + 1, .. . , n} denote the memory-shortened series derived from {Y,} by 
the algorithm of the previous section and let § denote the sample mean of Sj4.1,..., Sn- 

The next step in the modeling procedure is to fit an autoregressive process to the 
mean-corrected series 


X,=S,—-S, t=kH+l,...,n. 
The fitted model has the form 
Xi = OX + Qn Xi, + Oy Xt- + ORXt-4 + Zt, 


where {Z,} ~ WN (0, o°), and for given lags, /;, l2, and /;, the coefficients ø; and the 
white noise variance o° are found from the Yule—Walker equations 


| 1 Asie Phi ats) Tf] | 2o | 
ôL — 1) 1 ôL -—l) ĝl -— li) Qı, pli) 


ôl- 1) plb—-h) 1 ôL- hb) || n |7 eG) 
P&-1) ôlh-—-h) ôL -—h) 1 Pur pb) 


o? = P0) [1 — $160) — 6,60) — rÊ lh) — 6,6()]. 


where 7(j) and ô(j), j = 0, 1,2,..., are the sample autocovariances and autocor- 
relations of the series {X,}. 
The program computes the coefficients ¢; for each set of lags such that 


and 


1l<l<h<h<™m, 


where m can be chosen to be either 13 or 26. It then selects the model for which the 
Yule—Walker estimate o? is minimal and prints out the lags, coefficients, and white 
noise variance for the fitted model. 

A slower procedure chooses the lags and coefficients (computed from the Yule— 
Walker equations as above) that maximize the Gaussian likelihood of the observations. 
For this option the maximum lag m is 13. 

The options are displayed in the ARAR Forecasting dialog box, which appears 
on the screen when the option Forecasting>ARAR is selected. It allows you also to 
bypass memory shortening and fit a subset AR to the original (mean-corrected) data. 
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9.1.3 Forecasting 


If the memory-shortening filter found in the first step has coefficients wo(= 1), 
Wi,..-, Wx (k = 0), then the memory-shortened series can be expressed as 


S, = Y(B)Y, = Y, + Yi Y-i + + WN 4 (9.1.3) 
where w(B) is the polynomial in the backward shift operator, 
W(B) = 1+ yB +--+ pB“. 


Similarly, if the coefficients of the subset autoregression found in the second step are 
1, n, P,, and ¢;,, then the subset AR model for the mean-corrected series [x <= 


p(B)X, = Z,, (9.1.4) 
where {Z,} ~ WN (0, o?) and 

¢(B) = 1 — $B — $n B" — $, B” — 6, B°. 
From (9.1.3) and (9.1.4) we obtain the equations 

E(B)Y, = @(1)S + Z,, (9.1.5) 


where 
E(B) = W(B)O(B) = 1 + &B +--+ + Eken B‘. 


Assuming that the fitted model (9.1.5) is appropriate and that the white noise 
term Z; is uncorrelated with {Y;, j < t} for each ft, we can determine the minimum 
mean squared error linear predictors Pa, Y,+, of Y,+, in terms of {1, Y1,..., Ya}, for 
n > k + l, from the recursions 

k+l; 
PaYn+h = &;j PaYn+n—j F Ds, h >l, (9.1.6) 
j=l 


with the initial conditions 


PaYn+h = In+hs for h < 0. (9.1.7) 
The mean squared error of the predictor P, Y,+; is found to be (Problem 9.1) 
h-1 
E eae =. Pa Ynn) | = TOS (9.1.8) 
j=l 


where X i t;z/ is the Taylor expansion of 1/&(z) in a neighborhood of z = 0. 
Equivalently the sequence {t;} can be found from the recursion 


Oa eer Sey (9.1.9) 
j=0 
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Example 9.1.1 


9.1.4 Application of the ARAR Algorithm 


To determine an ARAR model for a given data set {Y,} using ITSM, select Fore- 
casting>ARAR and choose the appropriate options in the resulting dialog box. These 
include specification of the number of forecasts required, whether or not you wish 
to include the memory-shortening step, whether you require prediction bounds, and 
which of the optimality criteria is to be used. Once you have made these selections, 
click OK, and the forecasts will be plotted with the original data. Right-click on the 
graph and then Info to see the coefficients 1, y1, ..., Yg of the memory-shortening 
filter w(B), the lags and coefficients of the subset autoregression 


Xi — Pi Xi — by Xizr — O_X1-b — OHX1-4 = Zi, 
and the coefficients £; of B/ in the overall whitening filter 
E(B) = (L4+WiB+--- +B") (1—¢B — h, B" — h, B? — ,B"). 


The numerical values of the predictors, their root mean squared errors, and the pre- 
diction bounds are also printed. 


To use the ARAR algorithm to predict 24 values of the accidental deaths data, open the 
file DEATHS.TSM and proceed as described above. Selecting Minimize WN vari- 
ance [max lag=26] gives the graph of the data and predictors shown in Figure 
9.1. Right-clicking on the graph and then Info, we find that the selected memory- 
shortening filter is (1 = STT9B"). The fitted subset autoregression and the coeffi- 
cients £; of the overall whitening filter (B) are shown below: 


Optimal lags 1 3 12 13 
Optimal coeffs .5915 -.3822 -.3022 .2970 
WN Variance: .12314E+06 
COEFFICIENTS OF OVERALL WHITENING FILTER: 
1.0000 -.5915 .0000 -.2093 .0000 

.0000 .0000 .0000 .0000 .0000 

-0000 -0000 -.6757 .2814 -0000 

.2047 .0000 .0000 .0000 .0000 

.0000 .0000 .0000 .0000 -.2955 

.2904 


In Table 9.1 we compare the predictors of the next six values of the accidental 
deaths series with the actual observed values. The predicted values obtained from 
ARAR as described in the example are shown together with the predictors obtained 
by fitting ARIMA models as described in Chapter 6 (see Table 9.1). The observed 


root mean squared errors (i.e., Ni ar (Y724n — P72 ¥721n)7/6 ) for the three prediction 
methods are easily calculated to be 253 for ARAR, 583 for the ARIMA model (6.5.8), 
and 501 for the ARIMA model (6.5.9). The ARAR algorithm thus performs very 
well here. Notice that in this particular example the ARAR algorithm effectively fits 
a causal AR model to the data, but this is not always the case. 
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Figure 9-1 

The data set DEATHS.TSM 
with 24 values predicted 
by the ARAR algorithm. 
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9.2 The Holt-Winters Algorithm 


Table 9.1 


9.2.1 The Algorithm 


Given observations Y,, Y2, ..., Y„ from the “trend plus noise” model (1.5.2), the 
exponential smoothing recursions (1.5.7) allowed us to compute estimates m, of the 
trend at times t = 1, 2,...,n. If the series is stationary, then m, is constant and the 
exponential smoothing forecast of Y,,4, based on the observations Y;,..., Y„ is 


P Yar = Mn, h=1,2,.... (9.2.1) 


If the data have a (nonconstant) trend, then a natural generalization of the forecast 
function (9.2.1) that takes this into account is 


P,Ynth = Gn +b,h, h=1,2,..., (9.2.2) 


where a, and Bn can be thought of as estimates of the “level” a, and “slope” b, of 
the trend function at time n. Holt (1957) suggested a recursive scheme for computing 


Predicted and observed values of the accidental deaths series for 
t =73,...,78. 


t 73 74 75 76 77 78 


Observed Y, 7798 7406 8363 8460 9217 9316 
Predicted by ARAR 8168 7196 7982 8284 9144 9465 
Predicted by (6.5.8) 8441 7704 8549 8885 9843 10279 
Predicted by (6.5.9) 8345 7619 8356 8742 9795 10179 
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the quantities a, and b, in (9.2.2). Denoting by Îi the one-step forecast P, Y„+1, we 
have from (9.2.2) 
ane = an + by. 


Now, as in exponential smoothing, we suppose that the estimated level at time n + 1 
is a linear combination of the observed value at time n + 1 and the forecast value at 
time n + 1. Thus, 


Ging) = O¥n41 + (1 — A) (ân + by). (9.2.3) 


We can then estimate the slope at time n + 1 as a linear combination of â„+1 — â„ and 
the estimated slope b,, at time n. Thus, 


bni = B (Gn41 — Gn) + (1 — B) bn. (9.2.4) 
In order to solve the recursions (9.2.3) and (9.2.4) we need initial conditions. A natural 
choice is to set 

a = Y, (9.2.5) 
and 

bp = Y2 — Nh. (9.2.6) 


Then (9.2.3) and (9.2.4) can be solved successively for â; and bi, i = 3,...,n, and 
the predictors P,, Y,., found from (9.2.2). 

The forecasts depend on the “smoothing parameters” œ and 8. These can either be 
prescribed arbitrarily (with values between 0 and 1) or chosen in a more systematic 
way to minimize the sum of squares of the one-step errors Ra (Y; — P; Y}, 
obtained when the algorithm is applied to the already observed data. Both choices 
are available in the ITSM option Forecasting>Holt-Winters. 

Before illustrating the use of the Holt-Winters forecasting procedure, we discuss 
the connection between the recursions (9.2.3)-(9.2.4) and the steady-state solution 
of the Kalman filtering equations for a local linear trend model. Suppose {Y, } follows 
the local linear structural model with observation equation 


Y, = M, + W, 


and state equation 


Mar | i afaj y 

Bist JO 1 B, U, 
(see (8.2.4)—(8.2.7)). Now define a, and b, to be the filtered estimates of M,, and B,, 
respectively, i.e., 


Gn = Main = PM, 
by = Bun = P, Bn. 
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Using Problem 8.17 and the Kalman recursion (8.4.10), we find that 
ie l = fe N bn l LAO: (x = e bn) l (9.2.7) 
n+1 n 


where G = | 1 0 | Assuming that Q, = Q; = [Q; AA j=1 is the steady-state solution 


of (8.4.2) for this model, then A, = Q), + o? for all n, so that (9.2.7) simplifies to 
the equations 


PE AN ama (£ Ân bn) (9.2.8) 
Qı + 02 
and 
bat =, b, H gue (Y, Ân bn) . (9.2.9) 
i +02 


Solving (9.2.8) for (Y, — ân — bn) and substituting into (9.2.9), we find that 


Ân+1 = Nn 41 ab (1 — q) (ån + bn) , (9.2.10) 


bni = B (ânı — Gn) + 1 — Bb, (9.2.11) 
with a = Q),/ (Qi + o2) and B = Q21/ Q11. These equations coincide with the 
Holt-Winters recursions (9.2.3)—(9.2.4). Equations relating «œ and £ to the variances 
o?, o}, and o2 can be found in Harvey (1990). 


To predict 24 values of the accidental deaths series using the Holt-Winters algorithm, 
open the file DEATHS.TSM and select Forecasting>Holt-Winters. In the result- 
ing dialog box specify 24 for the number of predictors and check the box marked 
Optimize coefficients for automatic selection of the smoothing coefficients a 
and £. Click OK, and the forecasts will be plotted with the original data as shown in 
Figure 9.2. Right-click on the graph and then Info to see the numerical values of the 
predictors, their root mean squared errors, and the optimal values of œ and £. 


The root mean squared error (i Se Yan — PnYnan)/ 6) for the nonseasonal 
Holt—Winters forecasts is found to be 1143. Not surprisingly, since we have not taken 
seasonality into account, this is a much larger value than for the three sets of forecasts 
shown in Table 9.1. In the next section we show how to modify the Holt-Winters 
algorithm to allow for seasonality. 


9.2.2 Holt-Winters and ARIMA Forecasting 


The one-step forecasts obtained by exponential smoothing with parameter œ (defined 
by (1.5.7) and (9.2.1)) satisfy the relations 


PiYn+i = A a (1 a a)(Y, = Pr-1Yn), n= 2. (9.2.12) 
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Figure 9-2 


The data set DEATHS.TSM 
with 24 values predicted 
by the nonseasonal 
Holt-Winters algorithm. 


Table 9.2 
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But these are the same relations satisfied by the large-sample minimum mean squared 
error forecasts of the invertible ARIMA(0,1,1) process 


Y,=Y,1+Z,-(l—a@)Z,1, {Z} ~ WN (0,0°). (9.2.13) 


Forecasting by exponential smoothing with optimal œ can therefore be viewed as 
fitting a member of the two-parameter family of ARIMA processes (9.2.13) to the data 
and using the corresponding large-sample forecast recursions initialized by PoY, = 
Yı. In ITSM, the optimal «œ is found by minimizing the average squared error of the 
one-step forecasts of the observed data Y>,..., Y,, and the parameter o° is estimated 
by this average squared error. This algorithm could easily be modified to minimize 
other error measures such as average absolute one-step error and average 12-step 
squared error. 

In the same way it can be shown that Holt—Winters forecasting can be viewed as 
fitting a member of the three-parameter family of ARIMA processes, 


(1 — B)’Y, = Z, — (2 — o — ap)Z + —a)Z;,-2, (9.2.14) 


Predicted and observed values of the accidental deaths series 
for t = 73,..., 78 from the (nonseasonal) Holt-Winters 


algorithm. 
t 73 74 75 76 77 78 
Observed Y, 7798 7406 8363 8460 9217 9316 


Predicted by HW 9281 9322 9363 9404 9445 9486 
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where {Z,} ~ WN(0, o°). The coefficients œ and £ are selected as described after 
(9.2.6), and the estimate of ø? is the average squared error of the one-step forecasts 
of Y3,..., Y, obtained from the large-sample forecast recursions corresponding to 
(9.2.14). 


9.3 The Holt-Winters Seasonal Algorithm 


9.3.1 The Algorithm 


If the series Y1, Y2, ..., Y, contains not only trend, but also seasonality with period 
d (as in the model (1.5.11)), then a further generalization of the forecast function 
(9.2.2) that takes this into account is 


PaYn+n = Ân + byh + Cath, h= 1, 2, e.’ (9.3.1) 


where ân, bn, and ¢, can be thought of as estimates of the “trend level” a,, “trend 
slope” b,, and “seasonal component” c, at time n. If k is the smallest integer such 
that n + h — kd < n, then we set 


Crh = Crehtdy WH 1h 25 oe, (9.3.2) 
while the values of G;,b;, and ¢;,i = d+2, . . . , n, are found from recursions analogous 
to (9.2.3) and (9.2.4), namely, 

Anat =O (Yapi — npi-a) + (1 — a) (ân + Ên), (9.3.3) 

baat = B (Gn41 — Gn) + (1 — B)bn, (9.3.4) 
and 

Enti = Y (Ynyr — Gngi) + (1 — venta: (9.3.5) 


with initial conditions 


da = Yayı, (9.3.6) 

basi = Yarı — Yı)/4, (9.3.7) 
and 

à& =Y; — (Yi + banli- D), i=1,...,d +1. (9.3.8) 


Then (9.3.3)-(9.3.5) can be solved successively for â;, bj, and ĉ, i =d+1,...,n, 
and the predictors P, Y„+n found from (9.3.1). 

As in the nonseasonal case of Section 9.2, the forecasts depend on the parameters 
a, B, and y. These can either be prescribed arbitrarily (with values between 0 and 1) 
or chosen in a more systematic way to minimize the sum of squares of the one-step 
errors )7_),5(¥; — P;-1¥;)’, obtained when the algorithm is applied to the already 
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Holt-Winters algorithm. 


Example 9.3.1 


Table 9.3 


observed data. Seasonal Holt-Winters forecasts can be computed by selecting the 
ITSM option Forecasting>Seasonal Holt-Winters. 


As in Example 9.2.1, open the file DEATHS.TSM, but this time select Forecast- 
ing>Seasonal Holt-Winters. Specify 24 for the number of predicted values re- 
quired, 12 for the period of the seasonality, and check the box marked Optimize 
Coefficients. Click OK, and the graph of the data and predicted values shown in 
Figure 9.3 will appear. Right-click on the graph and then on Info and you will see the 
numerical values of the predictors and the optimal values of the coefficients a, 6, and 
y (minimizing the observed one-step average squared error MA 4% — Rai Y;)*/59). 
Table 9.3 compares the predictors of Y73,..., Y7 with the corresponding observed 
values. 


The root mean squared error Gi S$ (Peas — PY7,1,)?/6 ) for the seasonal 
Holt—Winters forecasts is found to be 401. This is not as good as the value 253 
achieved by the ARAR model for this example but is substantially better than the 


Predicted and observed values of the accidental deaths series 


for t = 73,...,78 from the seasonal Holt-Winters algorithm. 
t 73 74 75 76 Tf 78 
Observed Y, 7798 7406 8363 8460 9217 9316 


Predicted by HWS 8039 7077 7750 7941 8824 9329 
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values achieved by the nonseasonal Holt—Winters algorithm (1143) and the ARIMA 
models (6.5.8) and (6.5.9) (583 and 501, respectively). 


9.3.2 Holt-Winters Seasonal and ARIMA Forecasting 


As in Section 9.2.2, the Holt-Winters seasonal recursions with seasonal period d 
correspond to the large-sample forecast recursions of an ARIMA process, in this 
case defined by 


(d = B)d aa BY, iz Zı eet Zi—-d+1 yd a)(Zy—-a = Zt—d—1) 
—(2—a—aB)(Z,-) +: + Z-a) 
+ (1 —a)(Z 2 + +++ + Zi-a-1), 


where {Z,} ~WN (0, o°). Holt—Winters seasonal forecasting with optimal «œ, 8, and y 
can therefore be viewed as fitting a member of this four-parameter family of ARIMA 
models and using the corresponding large-sample forecast recursions. 


9.4 Choosing a Forecasting Algorithm 


Real data are rarely if ever generated by a simple mathematical model such as an 
ARIMA process. Forecasting methods that are predicated on the assumption of such 
a model are therefore not necessarily the best, even in the mean squared error sense. 
Nor is the measurement of error in terms of mean squared error necessarily always 
the most appropriate one in spite of its mathematical convenience. Even within the 
framework of minimum mean squared-error forecasting, we may ask (for example) 
whether we wish to minimize the one-step, two-step, or twelve-step mean squared 
error. 

The use of more heuristic algorithms such as those discussed in this chapter 
is therefore well worth serious consideration in practical forecasting problems. But 
how do we decide which method to use? A relatively simple solution to this problem, 
given the availability of a substantial historical record, is to choose among competing 
algorithms by comparing the relevant errors when the algorithms are applied to the 
data already observed (e.g., by comparing the mean absolute percentage errors of the 
twelve-step predictors of the historical data if twelve-step prediction is of primary 
concern). 

It is extremely difficult to make general theoretical statements about the relative 
merits of the various techniques we have discussed (ARIMA modeling, exponential 
smoothing, ARAR, and HW methods). For the series DEATHS.TSM we found on 
the basis of average mean squared error for predicting the series at times 73-78 
that the ARAR method was best, followed by the seasonal Holt-Winters algorithm, 
and then the ARIMA models fitted in Chapter 6. This ordering is by no means 
universal. For example, if we consider the natural logarithms {Y,} of the first 130 


9.4 


Choosing a Forecasting Algorithm 329 


Figure 9-4 

The first 132 values of the 
data set AIRPASS.TSM 
and predictors of the last 
12 values obtained by 
direct application of 

the ARAR algorithm. 


observations in the series WINE.TSM (Figure 1.1) and compare the average mean 
squared errors of the forecasts of Y131, ..., Yia2, we find (Problem 9.2) that an MA(12) 
model fitted to the mean corrected differenced series {Y, — Y;_\2} does better than 
seasonal Holt—Winters (with period 12), which in turn does better than ARAR and 
(not surprisingly) dramatically better than nonseasonal Holt—Winters. An interesting 
empirical comparison of these and other methods applied to a variety of economic 
time series is contained in Makridakis et al. (1998). 

The versions of the Holt-Winters algorithms we have discussed in Sections 9.2 
and 9.3 are referred to as “additive,” since the seasonal and trend components enter the 
forecasting function in an additive manner. “Multiplicative” versions of the algorithms 
can also be constructed to deal directly with processes of the form 


Y, = MSi Zt, (9.4.1) 


where m,, s, and Z, are trend, seasonal, and noise factors, respectively (see, e.g., 
Makridakis et al., 1983). An alternative approach (provided that Y, > 0 for all £) is to 
apply the linear Holt-Winters algorithms to {In Y,} (as in the case of WINE.TSM in 
the preceding paragraph). Because of the rather general memory shortening permitted 
by the ARAR algorithm, it gives reasonable results when applied directly to series 
of the form (9.4.1), even without preliminary transformations. In particular, if we 
consider the first 132 observations in the series AIRPASS.TSM and apply the ARAR 
algorithm to predict the last 12 values in the series, we obtain (Problem 9.4) an 
observed root mean squared error of 18.21. On the other hand if we use the same 
data take logarithms, difference at lag 12, subtract the mean and then fit an AR(13) 
model by maximum likelihood using ITSM and use it to predict the last 12 values, we 
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obtain an observed root mean squared error of 21.67. The data and predicted values 
from the ARAR algorithm are shown in Figure 9.4. 

Problems 


9.1. Establish the formula (9.1.8) for the mean squared error of the h-step forecast 
based on the ARAR algorithm. 


9.2. Let {X,,..., X142} denote the data in the file WINE.TSM and let {Y;,..., Yj42} 
denote their natural logarithms. Denote by m the sample mean of the differenced 
series {Y, — Y,—12, t = 13,..., 130}. 


a. 


onan Ss 


h. 


Use the program ITSM to find the maximum likelihood MA(12) model for 
the differenced and mean-corrected series {Y, — Y;_12 — m, t = 13, ..., 130}. 


. Use the model in (a) to compute forecasts of {X131,..., X 142}. 
. Tabulate the forecast errors {X, — Pi30X,, t = 131,..., 142}. 
. Compute the average squared error for the 12 forecasts. 


. Repeat steps (b), (c), and (d) for the corresponding forecasts obtained by 


applying the ARAR algorithm to the series {X,,f = 1,..., 130}. 


. Repeat steps (b), (c), and (d) for the corresponding forecasts obtained by 


applying the seasonal Holt—Winters algorithm (with period 12) to the logged 
data {Y,,t=1,..., 130}. (Open the file WINE.TSM, select Transform>Box- 
Cox with parameter à = 0, then select Forecasting>Seasonal Holt- 
Winters, and check Apply to original data in the dialog box.) 


. Repeat steps (b), (c), and (d) for the corresponding forecasts obtained by 


applying the nonseasonal Holt—Winters algorithm to the logged data {Y,, t = 
1,..., 130}. (The procedure is analogous to that described in part (f).) 


Compare the average squared errors obtained by the four methods. 


9.3. In equations (9.2.10)-(9.2.11), show that a = Qy1/ (Qi +07) and 6 = Q21 / Q11- 


9.4. Verify the assertions made in the last paragraph of Section 9.4, comparing the 
forecasts of the last 12 values of the series AIRPASS.TSM obtained from the 
ARAR algorithm (with no log transformation) and the corresponding forecasts 
obtained by taking logarithms of the original series, then differencing at lag 12, 
mean-correcting, and fitting an AR(13) model to the transformed series. 


Further Topics 


10.1 Transfer Function Models 
10.2 Intervention Analysis 
10.3. Nonlinear Models 

10.4 Continuous-Time Models 
10.5 Long-Memory Models 


In this final chapter we touch on a variety of topics of special interest. In Section 10.1 
we consider transfer function models, designed to exploit for predictive purposes the 
relationship between two time series when one acts as a leading indicator for the other. 
Section 10.2 deals with intervention analysis, which allows for possible changes in 
the mechanism generating a time series, causing it to have different properties over 
different time intervals. In Section 10.3 we introduce the very fast growing area of 
nonlinear time series analysis, and in Section 10.4 we briefly discuss continuous-time 
ARMA processes, which, besides being of interest in their own right, are very useful 
also for modeling irregularly spaced data. In Section 10.5 we discuss fractionally 
integrated ARMA processes, sometimes called “long-memory” processes on account 
of the slow rate of convergence of their autocorrelation functions to zero as the lag 
increases. 


10.1 Transfer Function Models 


In this section we consider the problem of estimating the transfer function of a linear 
filter when the output includes added uncorrelated noise. Suppose that {X,,} and {X;2} 
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are, respectively, the input and output of the transfer function model 


[0.0] 
Xn => Xj +N, (10.1.1) 
j=0 
where T = {t;, j = 0,1,...} is a causal time-invariant linear filter and {N,} is a 
zero-mean Stationary process, uncorrelated with the input process {X,,}. We further 
assume that {X,,} is a zero-mean stationary time series. Then the bivariate process 
{(X,1, X2} is also stationary. Multiplying each side of (10.1.1) by X,_,,, and then 
taking expectations gives the equation 


yak) = So yy- j). (10.1.2) 
j=0 


Equation (10.1.2) simplifies a great deal if the input process happens to be white 
noise. For example, if {X;} ~ WN(O, Or), then we can immediately identify tų from 
(10.1.2) as 


T = yolk) /o;. (10.1.3) 


This observation suggests that “prewhitening” of the input process might simplify the 
identification of an appropriate transfer function model and at the same time provide 
simple preliminary estimates of the coefficients f,. 

If {X,;} can be represented as an invertible ARMA(p, q) process 


$(B)X11 = 0(B)Z,, {Z} ~ WN (0, 7), (10.1.4) 


then application of the filter 7(B) = ¢(B)6~'!(B) to {X;,} will produce the whitened 
series {Z,}. Now applying the operator 7(B) to each side of (10.1.1) and letting 
Y, = n (B)Xn, we obtain the relation 


Y, =Y GZ; +N, 
j=0 
where 
N; = n (B)N,, 


and {Nj} is a zero-mean stationary process, uncorrelated with {Z,}. The same argu- 
ments that led to (10.1.3) therefore yield the equation 


Tj = pyz(J)oy/oz, (10.1.5) 


where pyz is the cross-correlation function of {Y,} and {Z,}, oF =Var(Z,), and 
o? =Var(Y,). 

Given the observations {(X,1, X2)’,f = 1,...,n}, the results of the previous 
paragraph suggest the following procedure for estimating {t;} and analyzing the 
noise {N,} in the model (10.1.1): 
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1. Fitan ARMA model to {X,,} and file the residuals (ĉi Seeid Zn) (using the Export 
button in ITSM to copy them to the clipboard and then pasting them into the first 
column of an Excel file). Let o and Ê denote the maximum likelihood estimates 
of the autoregressive and moving-average parameters and let ô? be the maximum 
likelihood estimate of the variance of {Z;,}. 

2. Apply the operator ĉ (B) = $(B)6~!(B) to {X2} to obtain the series (Ê aw Ê,). 
(After fitting the ARMA model as in Step 1 above, highlight the window con- 
taining the graph of {X,} and replace {X,} by {Y,} using the option File>Import. 
The residuals are then automatically replaced by the residuals of {Y,} under the 
model already fitted to {X,}.) Export the new residuals to the clipboard, paste 
them into the second column of the Excel file created in Step 1, and save this 
as a text file, FNAME.TSM. The file FNAME.TSM then contains the bivariate 
series{(Z,, Y,)}.) Let ô? denote the sample variance of Y,. 

3. Compute the sample auto- and cross-correlation functions of {Z,} and {Y,} by 
opening the bivariate project FNAME.TSM in ITSM and clicking on the second 
yellow button at the top of the ITSM window. Comparison of /,, (A) with the 
bounds +1.96n7!/ gives a preliminary indication of the lags h at which p,, (h) 
is significantly different from zero. A more refined check can be carried out by 
using Bartlett’s formula in Section 7.3.4 for the asymptotic variance of ĝ,, (h). 
Under the assumptions that {Z,} ~ WN (0, 6?) and (È, Zt is a stationary 
Gaussian process, 


nVar(pyz(h)) ~ 1 — pzz(h) [is - Y (ok) + pirr] 


k=—00 


+ $ [orz(h + k)pyz(h — k) — 2pyz(h)pyz(k + h)pyy(k)] . 
k=—00 
In order to check the hypothesis Ho that pyz(h) = 0, h ¢ [a, b], where a and b 
are integers, we note from Corollary 7.3.1 that under Ho, 


Var (pyz(h)) ~n! for h ¢ [a, b]. 


We can therefore check the hypothesis Ho by comparing yz, h ¢ [a, b], with the 
bounds + 1.96n~!/?. Observe that pzy(h) should be zero for h > 0 if the model 
(10.1.1) is valid. 

4. Preliminary estimates of t, for the lags h at which #,, (h) is significantly different 
from zero are 


în = pyz(h)ôy/ôz. 


For other values of A the preliminary estimates are t, = 0. The numerical values 
of the cross-correlations pyz(h) are found by right-clicking on the graphs of the 
sample correlations plotted in Step 3 and then on Info. The values of ôz and dy 
are found by doing the same with the graphs of the series themselves. Let m > 0 
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be the largest value of j such that 7; is nonzero and let b > 0 be the smallest such 
value. Then b is known as the delay parameter of the filter {t;}. If m is very large 
and if the coefficients f i} are approximately related by difference equations of 
the form 


Tj — viji — +--+ — VpÎj-p = 0, jzb+p, 
then 7(B) = ake p t; B/ can be represented approximately, using fewer param- 
eters, as 

7 (B) = wo(1 — vB — +++ — v Bp)! BP. 
In particular, if ĉ; = 0, j < b, and t; = wow”, j > b, then 

T(B) = wo(1 — vi B)! B}. (10.1.6) 


Box and Jenkins (1976) recommend choosing Î(B) to be a ratio of two poly- 
nomials. However, the degrees of the polynomials are often difficult to estimate 
from {i as The primary objective at this stage is to find a parametric function 
that provides an adequate approximation to T (B) without introducing too large 
a number of parameters. If T(B) is represented as Î (B) = B’w(B)v™! (B) = 
B’ (wo + wiB +--+ + wB) (1 — vi B — --- — v, B?) with v(z) # O for |z| < 
1, then we define m = max(q + b, p). 


. The noise sequence {N,,t =m-+1,...,n} is estimated as 


Ñ, = Xn —T(B)Xn. 


(We set Ñ, = 0,f < m, inorder to compute Ñ,, t >m = max(b+4q, p)). The cal- 
culations are done in ITSM by opening the bivariate file containing {(X;1, X2)}, 
selecting Transfer>Specify Model, and entering the preliminary model found 
in Step 4. Click on the fourth green button to see a graph of the residuals {N,}. 
These should then be filed as, say, NOISE.TSM. 


. Preliminary identification of a suitable model for the noise sequence is carried 


out by fitting a causal invertible ARMA model 


& (BN, = 0 (B)W,, {W,} ~ WN (0, ow), (10.1.7) 
to the estimated noise Netty iis’ Ñ, filed as NOISE.TSM in Step 5. 


. At this stage we have the preliminary model 


b)(B)v(B)X2 = B’ (Byw(B)X1 + 0 (B)v(B)W,, 
where 7(B) = B’w(B)v~!(B) as in step (4). For this model we can compute 
W, (w, v, 6,0), t > m* = max(p2 + p, b + p + q), by setting W, = 0 
for t < m*. The parameters w, v, 6’, and @“ can then be reestimated (more 
efficiently) by minimizing the sum of squares 


SW? (wv, 6, 0%), 


t=m*+1 
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(The calculations are performed in ITSM by opening the bivariate project {(X11, 
Xn)}, selecting Transfer>Specify model, entering the preliminary model, and 
clicking OK. Then choose Transfer>Est imation, click OK, and the least squares 
estimates of the parameters will be computed. Pressing the fourth green button 
at the top of the screen will give a graph of the estimated residuals W,.) 

8. To test for goodness of fit, the estimated residuals {W,. t > m*} and (Zia > m*} 
should be filed as a bivariate series and the auto- and cross correlations compared 
with the bounds +1.96/,/n in order to check the hypothesis that the two series 
are uncorrelated white noise sequences. Alternative models can be compared 
using the AICC value that is printed with the estimated parameters in Step 7. 
It is computed from the exact Gaussian likelihood, which is computed using a 
state-space representation of the model, described in TSTM, Section 13.1. 


Sales with a leading indicator 


In this example we fit a transfer function model to the bivariate time series of Example 
7.1.2. Let 


Xa = (1 — B)Y, — .0228, + =2,...,150, 
Xn = (1 — B)Yn — .420, t =2,..., 150, 


where {Y,;} and {Y2}, t = 1,..., 150, are the leading indicator and sales data, 
respectively. It was found in Example 7.1.2 that {X,,} and {X;,2} can be modeled as 
low-order zero-mean ARMA processes. In particular, we fitted the model 


Xa = (1—.474B)Z,, {Z} ~ WNO, .0779), 


to the series {X,,}. We can therefore whiten the series by application of the filter 
Ê (B) = (1 — .474B)7'. Applying 7 (B) to both {X,;} and {X;2} we obtain 


Z, = (1 — .474B)! X, 62 =.0779, 
Ê, = (1 — .474B)' Xp, 67 = 4.0217. 


These calculations and the filing of the series [2,} and {yt were carried out us- 
ing ITSM as described in steps (1) and (2). Their sample auto- and cross-correlations, 
found as described in step (3), are shown in Figure 10.1. The cross-correlations 
Pzy(h) (top right) and pyz(h) (bottom left), when compared with the bounds 
+1.96(149)"'/? = +.161, strongly suggest a transfer function model for {X;2} in 
terms of {X,ı} with delay parameter 3. Since t; = pyz(j)Gy/Gz is decreasing ap- 
proximately geometrically for j > 3, we take T (B) to have the form (10.1.6), i.e., 


T(B) = wo(1 — vB)! B°. 


The preliminary estimates of wọ and v; are Hp) = 73 = 4.86 and 0, = T/T; = .698, the 
coefficients t; being estimated as described in step (4). The estimated noise sequence 
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Figure 10-1 

The sample correlation 
functions pj,(h), of Example 
10.1.1. Series 1 is {2} 

and Series 2 is {Y;}. 


is determined and filed using ITSM as described in step (5). It satisfies the equations 
Ñ, = Xn — 4.86B°(1 — .698B)'X, 1 =5,6,..., 150. 

Analysis of this univariate series with ITSM gives the MA(1) model 
N, = (1 — .364B)W,, {W,} ~ WN, .0590). 


Substituting these preliminary noise and transfer function models into equation 
(10.1.1) then gives 


Xn = 4.86B3(1 — .698 B)! X, + (1 — .364B)W,, {W,} ~ WN(O, .0590). 


Now minimizing the sum of squares (10.1.7) with respect to the parameters (wo, U1, 
o) as described in step (7), we obtain the least squares model 


Xn = 4.717B?(1 — .724B) ! Xi, + (1 — .582B)W,, (10.1.8) 
where {W,} ~ WN (0, .0486) and 
Xa = (l — .474B)Z,, {Z,} ~ WN(O, .0779). 


Notice the reduced white noise variance of {W,} in the least squares model as com- 
pared with the preliminary model. 

The sample auto- and cross-correlation functions of the series Z and W,, t= 
5,..., 150, are shown in Figure 10.2. All of the correlations lie between the bounds 
+1.96/./144, supporting the assumption underlying the fitted model that the residuals 
are uncorrelated white noise sequences. 
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Figure 10-2 

The sample correlation 
functions of the estimated 
residuals from the 

model fitted in Example 
10.1.1. Series 1 is {Z;} 
and Series 2 is {W,}. 
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10.1.1 Prediction Based on a Transfer Function Model 


When predicting X„+n,2 on the basis of the transfer function model defined by (10.1.1), 
(10.1.4), and (10.1.7), with observations of X,, and X;.,t = 1,...,n, our aim is to 
find the linear combination of 1, X\,,..., Xn1, X12, - - -, Xn2 that predicts X,,,;,. with 
minimum mean squared error. The exact solution of this problem can be found with 
the help of the Kalman recursions (see TSTM, Section 13.1 for details). The program 
ITSM uses these recursions to compute the predictors and their mean squared errors. 

In order to provide a little more insight, we give here the predictors P,X,+1 
and mean squared errors based on infinitely many past observations X,; and Xj, 
—oo < t <n. These predictors and their mean squared errors will be close to those 
based on X, and X», 1 < t < n, if n is sufficiently large. 

The transfer function model defined by (10.1.1), (10.1.4), and (10.1.7) can be 
rewritten as 


Xn =T(B)Xn + (B)W,, (10.1.9) 
X,ı = O(B)¢ |(B)Z,, (10.1.10) 
where (B) = 0™ (B)/p™ (B). Eliminating X,; gives 


Xn =) ajZ,-;+ >> Bj Wj, (10.1.11) 
j=0 j=0 


where a (B) = T(B)0(B)/$(B). 
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Noting that each limit of linear combinations of {X,1, X2, -0o < t < n} isa 
limit of linear combinations of {Z;, W,, —co < t < n} and conversely and that {Z,} 
and {W,} are uncorrelated, we see at once from (10.1.11) that 


P Xnțn2 = Sag eps aw eT (10.1.12) 


j=h 


Setting t = n +h in (10.1.11) and subtracting (10.1.12) gives the mean squared error 
h— h—1 

E (Xni2 a P,Xn+h2 i = 07 Sia) + Ow Be (10.1.13) 
j=0 


To compute the predictors P, Xn+n.2 we proceed as follows. Rewrite (10.1.9) as 
A(B)Xn = B?U(B)Xn + V(B)W,, (10.1.14) 
where A, U, and V are polynomials of the form 
A(B) =1—A,B—.---— A,B‘, 
U(B)=Uj+ UiB +--+ U,B", 
V(B)=14+ ViB+---+ V,B”. 


Applying the operator P, to equation (10.1.14) with t = n + h, we obtain 


Pho = a P, Xnth-j2 + yu; P, Xnth-b-j.1 s5 ViWnsn—j, (10.1.15) 
j=l j=0 j=h 
where the last sum is zero if h > v. 

Since {X,,} is uncorrelated with {W,}, the predictors appearing in the second 
sum in (10.1.15) are therefore obtained by predicting the univariate series {X,,} as 
described in Section 3.3 using the model (10.1.10). In keeping with our assumption 
that n is large, we can replace P, X jı for each j by the finite-past predictor obtained 
from the program ITSM. The values W;, j < n, are replaced by their estimated values 
W; from the least squares estimation in step (7) of the modeling procedure. 

Equations (10.1.15) can now be solved recursively for the predictors PX 2, 
P, Xn42, 2> P, X n43, Zypsieeie 


Sales with a leading indicator 


Applying the preceding results to the series {X;;, X;2,2 < t < 150} of Example 
10.1.1, and using the values X148, = —.093, X150,2 = .08, Wiso = —.0706, Wi49 = 
.1449, we find from (10.1.8) and (10.1.15) that 


Piso X 133.5 = .124X 150,2 + 4.717X 148.1 = 1.306 W150 + 421 W149 = —.228 
and, using the value Xj49,; = .237, that 
PisoX152,2 = -724 Pis0X151,2 + 4.717X149,1 + 421 Wis0 = .923. 
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In terms of the original sales data {Y,2} we have Y149,2 = 262.7 and 
Yo = Yi-12+ Xn + .420. 

Hence the predictors of actual sales are 
Piso Y151,2 = 262.70 — .228 + .420 = 262.89, 
Piso ¥152,2 = 262.89 + .923 + .420 = 264.23, 


where Pig is based on {1, Yi), Yi2, X51, X52, —0O < s < 150}, and it is assumed that 
Y,,; and Yj are uncorrelated with {X,,} and with {X,.}. The predicted values are in 
close agreement with those based on the finite number of available observations that 
are computed by ITSM. Since our model for the sales data is 


(1 — B)Y,. = 420 + 4.717B3(1 — .474B)(1 — .724B)"!Z, + (1 — .582B)W,, 


it can be shown, using an argument analogous to that which gave (10.1.13), that the 
mean squared errors are given by 


h-1 h-1 
E(¥1504n,2 — PisoYiso4n2)” = 07 ae + oy B 
j=0 j=0 
where 
Jag = 4.71723(1 — .474z)(1 — .724z) 11 — z)7! 
j=0 
and 
Y= Biz! = (1 — .582z)(1 — z)™'. 
j=0 


For h = 1 and 2 we obtain 
E(%s51,2 = Pio Yis, = .0486, 
E(Y15s2,2 — P%o¥152,2)7 = .0570, 


in close agreement with the finite-past mean squared errors obtained by ITSM. 

It is interesting to examine the improvement obtained by using the transfer func- 
tion model rather than fitting a univariate model to the sales data alone. If we adopt 
the latter course, we obtain the model 


Xn — .249X,_1.2 — -199X,—2,2 = U,, 


where {U,} ~ WN(0, 1.794) and X2 = Yn — Y,-1,2 — .420. The corresponding 
predictors of Yı51,2 and Y152,2 are easily found from the program ITSM to be 263.14 
and 263.58 with mean squared errors 1.794 and 4.593, respectively. These mean 
squared errors are dramatically worse than those obtained using the transfer function 
model. 
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Intervention Analysis 


During the period for which a time series is observed, it is sometimes the case that a 
change occurs that affects the level of the series. A change in the tax laws may, for 
example, have a continuing effect on the daily closing prices of shares on the stock 
market. In the same way construction of a dam on a river may have a dramatic effect 
on the time series of streamflows below the dam. In the following we shall assume 
that the time T at which the change (or “intervention’’) occurs is known. 

To account for such changes, Box and Tiao (1975) introduced a model for inter- 
vention analysis that has the same form as the transfer function model 


Y, =Y YX; +N, (10.2.1) 
j=0 
except that the input series {X,} is not a random series but a deterministic function of 
t. Itis clear from (10.2.1) that pee t;X,_; is then the mean of Y,. The function {X,} 
and the coefficients {t;} are therefore chosen in such a way that the changing level 
of the observations of {Y;} is well represented by the sequence Lo TjXı—j. Fora 
series {Y,} with EY, = 0 for t < T and EY, > 0 as t —> œ, a suitable input series is 
| 1 ift=T 
X, = (T) = (10.2.2) 
O ift AT. 
For a series {Y,} with EY, = 0 fort < T and EY, ~ a Æ 0 as t — œ, a suitable 
input series is 
1 ift>T 
=H(T)= 2 1,(k) = | = (10.2.3) 
0 ift <T. 
(Other deterministic input functions {X,} can also be used, for example when inter- 
ventions occur at more than one time.) The function [x rt having been selected by 
inspection of the data, the determination of the coefficients {t;} in (10.2.1) then re- 
duces to a regression problem in which the errors {N,} constitute an ARMA process. 
This problem can be solved using the program ITSM as described below. 

The goal of intervention analysis is to estimate the effect of the intervention 
as indicated by the term Eo T;X,—;j and to use the resulting model (10.2.1) for 
forecasting. For example, Wichern and Jones (1978) used intervention analysis to 
investigate the effect of the American Dental Association’s endorsement of Crest 
toothpaste on Crest’s market share. Other applications of intervention analysis can be 
found in Box and Tiao (1975), Atkins (1979), and Bhattacharyya and Layton (1979). 
A more general approach can also be found in West and Harrison (1989), Harvey 
(1990), and Pole, West, and Harrison (1994). 

As in the case of transfer function modeling, once {X,} has been chosen (usually as 
either (10.2.2) or (10.2.3)), estimation of the linear filter {7;} in (10.2.1) is simplified 
by approximating the operator T(B) = oes, t;B/ with a rational operator of the 
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form 


= B?W(B) 
T(B) = VB)” (10.2.4) 


where b is the delay parameter and W(B) and V(B) are polynomials of the form 
W(B)=wotwiB+---+w,B! 

and 
V(B) =1—v,B-—---—v,B?. 


By suitable choice of the parameters b, q, p and the coefficients w; and vj, the 
intervention term T (B)X, can made to take a great variety of functional forms. 

For example, if T(B) = wB?/(1 — vB) and X, = J,(T) as in (10.2.2), the 
resulting intervention term is 


w B? 


vB) =% wh; Ea ! 
a SaB ay wl,_j-2(T) you wI(T +2+ jf), 


j=0 


a series of pulses of sizes v’ w at times T +2+ j, j =0,1,2,....If|v| < 1, the effect 
of the intervention is to add a series of pulses with size w at time T + 2, decreasing 
to zero at a geometric rate depending on v as t — ow. Similarly, with X, = H,(T) as 
in (10.2.3), 


w B? 


(1—vB) = : j ay, hs tet ; 
(aon)? 2? wH,_j-2(T) atest +v)wl(T +2+ j), 


j=0 j=0 


a series of pulses of sizes (1 + u+---+ v/)w at times T +2 + j, j =0,1,2,.... 
If |v| < 1, the effect of the intervention is to bring about a shift in level of the series 
X,, the size of the shift converging to w/(1 — v) as t > oo. 

An appropriate form for X, and possible values of b, q, and p having been chosen 
by inspection of the data, the estimation of the parameters in (10.2.4) and the fitting 
of the model for {N,} can be carried out using steps (6)—-(8) of the transfer function 
modeling procedure described in Section 10.1. Start with step (7) and assume that 
{N,} is white noise to get preliminary estimates of the coefficients w; and v; by least 
squares. The residuals are filed and used as estimates of {N,}. Then go to step (6) and 
continue exactly as for transfer function modeling with input series {X,} and output 
series {Y,}. 


Seat-belt legislation 


In this example we reanalyze the seat-belt legislation data, SBL.TSM of Example 
6.6.3 from the point of view of intervention analysis. For this purpose the bivariate 
series {(f;, Y,)} consisting of the series filed as SBLIN.TSM and SBL.TSM respec- 
tively has been saved in the file SBL2.TSM. The input series { f,} is the deterministic 
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Figure 10-3 

The differenced series 

of Example 10.2.1 
(showing also the 

fitted intervention term 
accounting for the seat-belt 
legislation of 1983). 
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step-function defined in Example 6.6.3 and Y, is the number of deaths and serious 
injuries on UK roads in month t, t = 1,..., 120, corresponding to the 10 years 
beginning with January 1975. 

To account for the seat-belt legislation, we use the same model (6.6.15) as in 
Example 6.6.3 and, because of the apparent non-stationarity of the residuals, we 
again difference both { f,} and {Y,} at lag 12 to obtain the model (6.6.16), i.e., 


X, = bg, + N,, (10.2.4) 


where X, = Vio¥;, 8; = Viz fa, and {N,} is a zero-mean stationary time series. This 
is a particularly simple example of the general intervention model (10.2.1) for the 
series {X,} with intervention {bg,}. Our aim is to find a suitable model for {N,} and 
at the same time to estimate b, taking into account the autocorrelation function of 
the model for {N,}. To apply intervention analysis to this problem using ITSM, we 
proceed as follows: 


(1) Open the bivariate project SBL2.TSM and difference the series at lag 12. 


(2) Select Transfer>Specify model and you will see that the default input and 
noise are white noise, while the default transfer model relating the input g, to 
the output X, is X, = bg, with b = 1. Click OK, leaving these settings as they 
are. The input model is irrelevant for intervention analysis and estimation of 
the transfer function with the default noise model will give us the ordinary least 
squares estimate of b in the model (10.2.4), with the residuals providing estimates 
of N,. Now selection Transfer>Estimation and click OK. You will then see the 
estimated value —346.9 for b. Finally, press the red Export button (top right in 
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the ITSM window) to export the residuals (estimated values of N,) to a file and 
call it, say, NOISE.TSM. 


(3) Without closing the bivariate project, open the univariate project NOISE.TSM. 
The sample ACF and PACE of the series suggests either an MA(13) or AR(13) 
model. Fitting AR and MA models of order up to 13 (with no mean-correction) 
using the option Model>Estimation>Autofit gives an MA(12) model as the 
minimum AICC fit. 


Return to the bivariate project by highlighting the window labeled SBL2.TSM 
and select Transfer>Specify model. The transfer model will now show the 
estimated value —346.9 for b. Click on the Residual Model tab, enter 12 for 
the MA order and click OK. Select Transfer>Estimation and again click OK. 
The parameters in both the noise and transfer models will then be estimated and 
printed on the screen. Repeating the minimization with decreasing step-sizes, .1, 
.01 and then .001, gives the model, 


X, = —362.5g, + N,, 


where N, = W, + .207W,_; + .311W,_2 + .105W,_3 + .040W,_4 + .194W,_5 + 
.100W,—6+.299W,—1+.080W,-8+.125W,-9+.210W,—10+.109W,-11+.501 W,—12, 
and {W,} ~ WN(0,17289). File the residuals (which are now estimates of {W,}) as 
RES.TSM. The differenced series {X,} and the fitted intervention term, —362.52,, 
are shown in Figure 10.3. 


(4 


~ 


(5) Open the univariate project RES.TSM and apply the usual tests for randomness 
by selecting Statistics>Residual Analysis. The tests are all passed at level 
.05, leading us to conclude that the model found in step (4) is satisfactory. The 
sample ACF of the residuals is shown in Figure 10.4. 


10.3 Nonlinear Models 


A time series of the form 


X,=) YZ; {Z} ~ ID (0,07), (10.3.1) 
j=0 

where Z, is expressible as a mean square limit of linear combinations of {X,,0o < 
s < t}, has the property that the best mean square predictor E (X,+a|Xs, —oo < s < t) 
and the best linear predictor P, X +n In terms of {X,, —co < s < t} are identical. It 
can be shown that if iid is replaced by WN in (10.3.1), then the two predictors are 
identical if and only if {Z,} is a martingale difference sequence relative to {X,}, i.e., 

if and only if E(Z,|X,;, -oo < s < t) =O for all t. 
The Wold decomposition (Section 2.6) ensures that every purely nondeterministic 
stationary process can be expressed in the form (10.3.1) with {Z,} ~ WN (0, o°). The 
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Figure 10-4 

The sample ACF of the 
residuals from the model 
in Example 10.2.1 
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process {Z,} in the Wold decomposition, however, is generally not an iid sequence, 
and the best mean square predictor of X;+n may be quite different from the best linear 
predictor. 

In the case where {X,} is a purely nondeterministic Gaussian stationary process, 
the sequence {Z,} in the Wold decomposition is Gaussian and therefore iid. Every 
stationary purely nondeterministic Gaussian process can therefore be generated by 
applying a causal linear filter to an iid Gaussian sequence. We shall therefore refer to 
such a process as a Gaussian linear process. 

In this section we shall use the term linear process to mean a process {X,} of the 
form (10.3.1). This is a more restrictive use of the term than in Definition 2.2.1. 


10.3.1 Deviations from Linearity 


Many of the time series encountered in practice exhibit characteristics not shown by 
linear processes, and so to obtain good models and predictors it is necessary to look 
to models more general than those satisfying (10.3.1) with iid noise. As indicated 
above, this will mean that the minimum mean squared error predictors are not, in 
general, linear functions of the past observations. 

Gaussian linear processes have a number of properties that are often found to 
be violated by observed time series. The former are reversible in the sense that 
(Xan, au Xn) has the same distribution as (Xn, ee, Xa} - (Except in a few special 
cases, ARMA processes are reversible if and only if they are Gaussian (Breidt and 
Davis, 1992).) Deviations from this property by observed time series are suggested 
by sample paths that rise to their maxima and fall away at different rates (see, for ex- 
ample, the sunspot numbers filed as SUNSPOTS.TSM). Bursts of outlying values are 
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Figure 10-5 

A sequence generated 
by the recursions 

Xn = 4Xp—1(1 — Xn-1). 


frequently observed in practical time series and are seen also in the sample paths of 
nonlinear (and infinite-variance) models. They are rarely seen, however, in the sample 
paths of Gaussian linear processes. Other characteristics suggesting deviation from 
a Gaussian linear model are discussed by Tong (1990). 

Many observed time series, particularly financial time series, exhibit periods 
during which they are “less predictable” (or “more volatile”), depending on the past 
history of the series. This dependence of the predictability (i.e., the size of the pre- 
diction mean squared error) on the past of the series cannot be modeled with a linear 
time series, since for a linear process the minimum h-step mean squared error is 
independent of the past history. Linear models thus fail to take account of the pos- 
sibility that certain past histories may permit more accurate forecasting than others, 
and cannot identify the circumstances under which more accurate forecasts can be 
expected. Nonlinear models, on the other hand, do allow for this. The ARCH and 
GARCH models considered below are in fact constructed around the dependence of 
the conditional variance of the process on its past history. 


10.3.2 Chaotic Deterministic Sequences 


To distinguish between linear and nonlinear processes, we need to be able to decide in 
particular when a white noise sequence is also iid. Sequences generated by nonlinear 
deterministic difference equations can exhibit sample correlation functions that are 
very close to those of samples from a white noise sequence. However, the deterministic 
nature of the recursions implies the strongest possible dependence between successive 
observations. For example, the celebrated logistic equation (see May, 1976, and Tong, 
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Figure 10-6 

The sample autocorrelation 
function of the sequence 

in Figure 10.5. 


1990) defines a sequence {x,}, for any given xo, via the equations 
Xn = 4x,-1(1 = Xn—1)s 0 < Xo < 1. 


The values of x, are, for even moderately large values of n, extremely sensitive 
to small changes in x9. This is clear from the fact that the sequence can be expressed 
explicitly as 


X, = sin’ (2"arcsin (/xo)), n =0,1,2,.... 


A very small change ô in arcsin (x0) leads to a change 2” 8 in the argument of the sin e 
function defining x,,. If we generate a sequence numerically, the generated sequence 
will, for most values of xo in the interval (0,1), be random in appearance, with a 
sample autocorrelation function similar to that of a sample from white noise. The 
data file CHAOS.TSM contains the sequence x1, ..., X299 (correct to nine decimal 
places) generated by the logistic equation with x) = 2/10. The calculation requires 
specification of x9 to at least 70 decimal places and the use of correspondingly high 
precision arithmetic. The series and its sample autocorrelation function are shown in 
Figures 10.5 and 10.6. The sample ACF and the AICC criterion both suggest white 
noise with mean .4954 as a model for the series. Under this model the best linear 
predictor of X29; would be .4954. However, the best predictor of X 9; to nine decimal 
places is, in fact, 4x299(1 — x290) = 0.016286669, with zero mean squared error. 
Distinguishing between iid and non-iid white noise is clearly not possible on the 
basis of second-order properties. For insight into the dependence structure we can 
examine sample moments of order higher than two. For example, the dependence in 
the data in CHAOS.TSM is reflected by a significantly nonzero sample autocorre- 
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lation at lag 1 of the squared data. In the following paragraphs we consider several 
approaches to this problem. 


10.3.3 Distinguishing Between White Noise and iid Sequences 


If {X,} ~ WN (0, o°) and E|X,|* < oo, a useful tool for deciding whether or not 
{X,} is iid is the ACF px2(h) of the process { X?}. If {X,} is iid, then px: (h) = 0 for 
all h Æ 0, whereas this is not necessarily the case otherwise. This is the basis for the 
test of McLeod and Li described in Section 1.6. 

Now suppose that {X;} is a strictly stationary time series such that E|X;,|* < K < 


oo for some integer k > 3. The kth-order cumulant C;(r1, ... , Fk-1) of {X;} is then 
defined as the joint cumulant of the random variables, X,, X;4,,,---, Xir 1€., as 
the coefficient of i*z,z.--+z, in the Taylor expansion about (0, ..., 0) of 


XZ- Ze) = In Efexp(izi X, + izrXran ++ iX) (10.3.2) 


(Since {X,} is strictly stationary, this quantity does not depend on t.) In particular, 
the third-order cumulant function C3 of {X;} coincides with the third-order central 
moment function, i.e., 


C3(r, s) = E[(X, E U(X ir = U)(Xi+s E u), r, S E {0, Ei es $ 


where u = EX,. If X, >>, |C3(r, 5)| < 00, we define the third-order polyspectral 
density (or bispectral density) of {X,} to be the Fourier transform 


1 CO CO : 2 
f3(@1, @2) = —— Ca(r, sje "OT, m < w, < 7, 
(27)? 


r=—OW S=—CO 


in which case 
T T , , 
C3(r, s) =f f eati F Cw, adw dar. 
=J i 


[More generally, if the kth order cumulants C(r1, -++ ,rg-1), of {X;} are absolutely 
summable, we define the kth order polyspectral density as the Fourier transform of 
C,. For details see Rosenblatt (1985) and Priestley (1988).] 

If {X,} is a Gaussian linear process, it follows from Problem 10.3 that the cumulant 
function C3 of {X;} is identically zero. (The same is also true of all the cumulant 
functions C, with k > 3.) Consequently, f3(@1, œ) = 0 for all w, œ € [—z, 7]. 
Appropriateness of a Gaussian linear model for a given data set can therefore be 
checked by using the data to test the null hypothesis f3 = 0. For details of such a 
test, see Subba-Rao and Gabr (1984). 

If {X,} is a linear process of the form (10.3.1) with E|Z,|> < œo, EZ} = n, and 
Ree, ly; | < œ, it can be shown from (10.3.2) (see Problem 10.3) that the third-order 
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cumulant function of {X,} is given by 


Clr, s) =N $O WidiarWiss (10.3.3) 
(with y; = 0 for j < 0), and hence that {X,} has bispectral density 
falon, 2) = y (er) y (e) y (e), (10.3.4) 


where Y (z) := } 72o Wyz’. By Proposition 4.3.1, the spectral density of {X,} is 


o? —io\ |2 
fo = zz YT. 
Hence, 


P(@1, @2) := | fs(@1, @2)|? _ n? 
L “F@n) f(@) f(@1 +02) 2008 


Appropriateness of the linear process (10.3.1) for modeling a given data set can 
therefore be checked by using the data to test for constancy of ġ (w1, w2) (see Subba- 
Rao and Gabr, 1984). 


10.3.4 Three Useful Classes of Nonlinear Models 


If it is decided that a linear Gaussian model is not appropriate, there is a choice of 
several families of nonlinear processes that have been found useful for modeling 
purposes. These include bilinear models, autoregressive models with random coeffi- 
cients, and threshold models. Excellent accounts of these are available in Subba-Rao 
and Gabr (1984), Nicholls and Quinn (1982), and Tong (1990), respectively. 

The bilinear model of order (p, q, r, s) is defined by the equations 


P q r sS 
X= Z,4 Soa, Xi + BaT I Y Galp 
i=l j=l 


i=l j=l 


where {Z,} ~ iid (0, o°). A sufficient condition for the existence of a strictly station- 
ary solution of these equations is given by Liu and Brockwell (1988). 
A random coefficient autoregressive process {X,} of order p satisfies an equation 
of the form 
Pp 


X, = 5 (¢; + ue) Xii + Zr, 


i=1 


where {Z,} ~ IID (0, o°), {u;,?} ~ IID (0, v’), {Z,} is independent of {U,}, and 
Qi,- p ER. 

Threshold models can be regarded as piecewise linear models in which the linear 
relationship varies with the values of the process. For example, if R,i = 1,...,k, 
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is a partition of R”, and {Z,} ~ IID(O, 1), then the k difference equations 


P . i 
X= 092%, 4 OP Ky, Kreo Xp) € RO, i=1,uk, (10.3.5) 
j=l 


define a threshold AR(p) model. Model identification and parameter estimation for 
threshold models can be carried out in a manner similar to that for linear models 
using maximum likelihood and the AIC criterion. 


10.3.5 Modeling Volatility 


For modeling changing volatility as discussed above under deviations from linearity, 
Engle (1982) introduced the ARCH(p) process {X,} as a solution of the equations 


Z, = Vine, {e} ~ UD NO, 1), (10.3.6) 


where h, is the (positive) function of {Z,, s < t}, defined by 


P 
h, =% + Y ai Z?;, (10.3.7) 
i=l 
with a > 0 anda; > 0, j = 1,..., p. The name ARCH signifies autoregressive 


conditional heteroscedasticity. h, is the conditional variance of Z, given {Z,,5 < t}. 
The simplest such process is the ARCH(1) process. In this case the recursions 
(10.3.6) and (10.3.7) give 


Oe. 2 2 2 
Z; = aoe; +4 Ze; 


2 29. O 
= Qoe; + AAE; e; HA Zi €e 


—n* 


n 
= GADD 2 n+l 72 2520 2 
= a X je, Cy eij + Oty Zi n—1%1 €i- POE 
j=0 


If |a,| < 1 and {Z,} is stationary and causal (i.e., Z, is a function of {e,, $ < t}), 
then the last term has expectation wt! E Z? and consequently (by the Borel-Cantelli 
lemma) converges to zero with probability one as n — oo. The first term converges 
with probability one by Proposition 3.1.1 of TSTM, and hence 


CO 
Z =a > of ere? piety (10.3.8) 
j=0 


From (10.3.8) we immediately find that 


EZ? = œ%/(1 — o1). (10.3.9) 
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Since 


Z, = & |œ (14 Seale et) (10.3.10) 
j=l 


it is clear that {Z,} is strictly stationary and hence, since EZ? < oo, also stationary 
in the weak sense. We have now established the following result. 


Solution of the ARCH(1) Equations: 


If |æı| < 1, the unique causal stationary solution of the ARCH(1) equations is 
given by (10.3.10). It has the properties 


E(Z,) = E(E(Zles,s < t)) =0, 
Var(Z;) = œo/(1 — a1), 
and 


E(Zr4nZ,) = E(E(Zr4nZiles,5 < t +h)) = 0 forh > 0. 


Thus the ARCH(1) process with |a;| < 1 is strictly stationary white noise. 
However, it is not an iid sequence, since from (10.3.6) and (10.3.7), 


E(Z?|Z,-1) = (@ + %1 Z?_,)E(e?|Z,-1) = a + o1 Z. 


This also shows that {Z,} is not Gaussian, since strictly stationary Gaussian white 
noise is necessarily iid. From (10.3.10) it is clear that the distribution of Z, is sym- 
metric, i.e., that Z, and —Z, have the same distribution. From (10.3.8) it is easy to 
calculate E(Z}) (Problem 10.4) and hence to show that E(Z?) is finite if and only 
if 3a7 < 1. More generally (see Engle, 1982), it can be shown that for every a; 
in the interval (0, 1), E (Z**) = oo for some positive integer k. This indicates the 
“heavy-tailed” nature of the marginal distribution of Z,. If EZ* < oo, the squared 
process Y, = Ze has the same ACF as the AR(1) process W, = a, W,_; + e;, a result 
that extends also to ARCH(p) processes (see Problem 10.5). 

The ARCH(p) process is conditionally Gaussian, in the sense that for given values 
of {Z;,s =t—1,t—2,...,t—p}, Z, is Gaussian with known distribution. This makes 
it easy to write down the likelihood of Z,,),...,Z, conditional on {Z),..., Zp} 
and hence, by numerical maximization, to compute conditional maximum likelihood 
estimates of the parameters. For example, the conditional likelihood of observations 
{Z2,.--, Zn} of the ARCH(1) process given Z; = z; is 


L= exp ; f 
II 2a (æo + @1z7_,) | 2 (ao + oiz?) 
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Example 10.3.1 An ARCH(1) series 


Figure 10.7 shows a realization of the ARCH(1) process with a = 1 and a; = 0.5. 
The graph of the realization and the sample autocorrelation function shown in Figure 
10.8 suggest that the process is white noise. This conclusion is correct from a second- 
order point of view. 
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Figure 10-8 

The sample autocorrelation PL | | | | 
function of the series 

in Figure 10.7. Lag 
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Figure 10-9 

The sample autocorrelation 
function of the 

squares of the data 

shown in Figure 10.7. 


However, the fact that the series is not a realization of iid noise is very strongly 
indicated by Figure 10.9, which shows the sample autocorrelation function of the 
series {Z?}. (The sample ACF of {|Z,|} and that of {Z?} can be plotted in ITSM by 
selecting Statistics>Residual Analysis>ACF abs values/Squares.) 

It is instructive to apply the Ljung—Box and McLeod-Li portmanteau tests for 
white noise to this series (see Section 1.6). To do this using ITSM, open the file 
ARCH.TSM, and then select Statistics>Residual Analysis>Tests of Ran- 
domness. We find (with h = 20) that the Ljung—Box test (and all the others except for 
the McLeod-Li test) are passed comfortably at level .05. However, the McLeod-Li 
test gives a p-value of 0 to five decimal places, clearly rejecting the hypothesis that 
the series is iid. 


The GARCH(p, q) process (see Bollerslev, 1986) is a generalization of the 
ARCH(p) process in which the variance equation (10.3.7) is replaced by 


p q 
h, =a0+ 9 a;Z? +9 Bh, (10.3.11) 
i=l j=l 


with a > 0 and gj, B; >0,j= nsei 

In the analysis of empirical financial data such as percentage daily stock returns 
(defined as 100 In (P, /P,—1), where P, is the closing price on trading day f), it is usually 
found that better fits to the data are obtained by relaxing the Gaussian assumption 
in (10.3.6) and supposing instead that the distribution of Z, given {Z,,s < t} has a 
heavier-tailed zero-mean distribution such as Student’s t-distribution. To incorporate 
such distributions we can define a general GARCH(p, q) process as a stationary 
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Example 10.3.2 


process {Z,} satisfying (10.3.11) and the generalized form of (10.3.6), 

Z, = Vhe, {e} ~ ID, 1). (10.3.12) 
For modeling purposes it is usually assumed in addition that either 

e,~ N(O, 1), (10.3.13) 
(as in (10.3.6)) or that 


v 
v—2 


where t, denotes Student’s t-distribution with v degrees of freedom. (The scale factor 
on the left of (10.3.14) is introduced to make the variance of e, equal to 1.) Other 
distributions for e, can also be used. 

One of the striking features of stock return data that is reflected by GARCH mod- 
els is the “persistence of volatility,’ or the phenomenon that large (small) fluctuations 
in the data tend to be followed by fluctuations of comparable magnitude. GARCH 
models reflect this by incorporating correlation in the sequence {h,} of conditional 
variances. 


ex ty, v> 2, (10.3.14) 


Fitting GARCH models to stock data 


The top graph in Figure 10.10 shows the percentage daily returns of the Dow Jones 
Industrial Index for the period July 1st, 1997, through April 9th, 1999, contained 
in the file E1032.TSM. The graph suggests that there are sustained periods of both 
high volatility (in October, 1997, and August, 1998) and of low volatility. The sam- 
ple autocorrelation function of this series, like that Example 10.3.1, has very small 
values, however the sample autocorrelations of the absolute values and squares of the 
data (like those in Example 10.3.1) are significantly different from zero, indicating 
dependence in spite of the lack of autocorrelation. (The sample autocorrelations of 
the absolute values and squares of the residuals (or of the data if no transformations 
have been made and no model fitted) can be seen by clicking on the third green button 
at the top of the ITSM window.) These properties suggest that an ARCH or GARCH 
model might be appropriate for this series. 
The model 


Y, =a + Z;, (10.3.15) 


where {Z,} is the GARCH(p, q) process defined by (10.3.11), (10.3.12) and (10.3.13), 
can be fitting using ITSM as follows. Open the project E1032.TSM and click on the 
red button labeled GAR at the top of the ITSM screen. In the resulting dialog box 
enter the desired values of p and q, e.g., 1 and 1 if you wish to fit a GARCH(1,1) 
model. You may also enter initial values for the coefficients ao, . . . , ap, and Bi,..., Bq 
or alternatively use the default values specified by the program. Make sure that Use 
normal noise is selected, click on OK and then click on the red MLE button. You will 
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Figure 10-10 

The daily percentage 
returns of the Dow 

Jones Industrial Index 
(E1032.TSM) from July 1, 
1997, through April 9, 
1999 (above), and the 
estimates of o, = Vh for 
the conditional Gaussian 
GARCH(1,1) model 

of Example 10.3.2. 


1.0 


0 100 200 300 400 
be advised to subtract the sample mean (unless you wish to assume that the parameter 
a in (10.3.15) is zero). If you subtract the sample mean it will be used as the estimate of 
a in the model (10.3.15). The GARCH Maximum Likelihood Estimation box will 


then open. When you click on OK the optimization will proceed. Denoting by {Z,} 
the (possibly) mean-corrected observations, the GARCH coefficients are estimated 


by numerically maximizing the likelihood of Z pEis sey Z, conditional on the known 
values Z;,..., Zp, and with assumed values 0 for each Z,, t < 0, and ô? for each h,, 
t < 0, where G? is the sample variance of {Z,,..., Z,}. In other words the program 
maximizes 

ef Z 
£00): Op B -bD = |] ~0(=), (10.3.16) 

Or Ot 

t=p+1 

with respect to the coefficients œo, . . . , œp and £1, . . . , Bg, where @ denotes the stan- 


dard normal density, and the standard deviations o, = /h;,t > 1, are computed 
recursively from (10.3.11) with Z, replaced by Z,, and with Z, = 0 and h, = 6? for 
t < 0. To find the minimum of —2In(L) it is advisable to repeat the optimization by 
clicking on the red MLE button and then on OK several times until the result stabilizes. 
It is also useful to try other initial values for ao, . . . , œp, and £1, ..., Bq, to minimize 
the chance of finding only a local minimum of —2In(L). Note that the optimization 
is constrained so that the estimated parameters are all non-negative with 


â+- +â +Ê + +Ê; <1, (10.3.17) 


and @ > 0. Condition (10.3.17) is necessary and sufficient for the corresponding 
GARCH equations to have a causal weakly stationary solution. 
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Comparison of models with different orders p and q can be made with the aid of 
the AICC, which is defined in terms of the conditional likelihood L as 


AICC := —2—"_InL + 2(p +q +.2)n/(n — p -q — 3). (10.3.18) 
n—p 


The factor n/(n — p) multiplying the first term on the right has been introduced to 
correct for the fact that the number of factors in (10.3.16) is only n — p. Notice also 
that the GARCH(p, q) model has p + q + | coefficients. 

The estimated mean is â = 0.0608 and the minimum-AICC GARCH model 
(with Gaussian noise) for the residuals, Z, = Y, — â, is found to be the GARCH(1,1) 
with estimated parameter values 


ĉo = 0.1300, & = 0.1266, B; = 0.7922, 


and an AICC value (defined by (10.3.18)) of 1469.02. The bottom graph in Figure 
10.10 shows the corresponding estimated conditional standard deviations, 6,, which 
clearly reflect the changing volatility of the series {Y,}. This graph is obtained from 
ITSM by clicking on the red SV (stochastic volatility) button. Under the model de- 
fined by (10.3.11), (10.3.12), (10.3.13) and (10.3.15), the GARCH residuals, {Z,/6;}, 
should be approximately IID N(0,1). A check on the independence is provided by 
the sample ACF of the absolute values and squares of the residuals, which is ob- 
tained by clicking on the fifth red button at the top of the ITSM window. These 
are found to be not significantly different from zero. To check for normality, select 
Garch>Garch residuals>QQ-Plot (normal). If the model is appropriate the re- 
sulting graph should approximate a straight line through the origin with slope 1. It 
is found that the deviations from the expected line are quite large for large values of 
|Z, , suggesting the need for a heavier-tailed model, e.g., a model with conditional 
t-distribution as defined by (10.3.14). 

To fit the GARCH model defined by (10.3.11), (10.3.12), (10.3.14) and (10.3.15) 
(i.e., with conditional t-distribution), we proceed in the same way, but with the con- 
ditional likelihood replaced by 


z Jv Zl. 
L(ao,..-,Q@p, Bi,.--, By, V) = fy . (10.3.19) 
i uk fi H ov — 2 xii. 
Maximization is now carried out with respect to the coefficients, wo,...,a@, B1,..-, Bg 


and the degrees of freedom v of the t-density, ¢,. The optimization can be performed 
using ITSM in exactly the same way as described for the GARCH model with 
Gaussian noise, except that the option Use t-distribution for noise should 
be checked in each of the dialog boxes where it appears. In order to locate the min- 
imum of —2In(L) it is often useful to initialize the coefficients of the model by first 
fitting a GARCH model with Gaussian noise and then carrying out the optimization 
using f-distributed noise. 

The estimated mean is å = 0.0608 as before and the minimum-AICC GARCH 
model for the residuals, Z, = Y, — â, is the GARCH(1,1) with estimated parameter 
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values 
ĉo = 0.1324, a, = 0.0672, Bi = 0.8400, b=5.714, 


and an AICC value (as in (10.3.18) with q replaced by q + 1) of 1437.89. Thus from 
the point of view of AICC, the model with conditional t-distribution is substantially 
better than the conditional Gaussian model. The sample ACF of the absolute values 
and squares of the GARCH residuals are much the same as those found using Gaussian 
noise, but the qq plot (obtained by clicking on the red QQ button and based on the 
t-distribution with 5.714 degrees of freedom) is closer to the expected line than was 
the case for the model with Gaussian noise. 

There are many important and interesting theoretical questions associated with 
the existence and properties of stationary solutions of the GARCH equations and their 
moments and of the sampling properties of these processes. As indicated above, in 
maximizing the conditional likelihood, ITSM constrains the GARCH coefficients to 
be non-negative and to satisfy the condition (10.3.17) with @ > 0. These conditions 
are sufficient for the process defined by the GARCH equations to be stationary. It is 
frequently found in practice that the estimated values of a,...,@, and B1,..., By 
have a sum which is very close to 1. A GARCH(p,q) model with a + --- + œp + 
bı + ---B, = 1 is called I-GARCH. Many generalizations of GARCH processes 
(ARCH-M, E-GARCH, I-GARCH, T-GARCH, FI-GARCH, etc., as well as ARMA 
models driven by GARCH noise, and regression models with GARCH errors) can 
now be found in the econometrics literature. 

ITSM can be used to fit ARMA and regression models with GARCH noise by 
using the procedures described in Example 10.3.2 to fit a GARCH model to the 
residuals {Z,} from the ARMA (or regression) fit. 


Fitting ARMA models driven by GARCH noise 


If we open the data file SUNSPOTS.TSM, subtract the mean and use the option 
Model>Estimation>Autofit with the default ranges for p and g, we obtain an 
ARMA(3,4) model for the mean-corrected data. Clicking on the second green button 
at the top of the ITSM window, we see that the sample ACF of the ARMA residuals 
is compatible with iid noise. However the sample autocorrelation functions of the 
absolute values and squares of the residuals (obtained by clicking on the third green 
button) indicate that they are not independent. To fit a Gaussian GARCH(1,1) model 
to the ARMA residuals click on the red GAR button, enter the value 1 for both p and 
q and click OK. Then click on the red MLE button, click OK in the dialog box, and the 
GARCH ML Estimates window will open, showing the estimated parameter values. 
Repeat the steps in the previous sentence two more times and the window will display 
the following ARMA(3,4) model for the mean-corrected sunspot data and the fitted 
GARCH model for the ARMA noise process {Z;}. 


X, = 2.463Z,_; — 2.248Z,_. + .757Z;_3 + Z; — .948Z,_1 
— .296Z,_. + .313Z,_3 + .136Z,_4, 
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where 


and 
h; = 31.152 + .223Z?_, + .596h;_1. 


The AICC value for the GARCH fit (805.12) should be used for comparing alternative 
GARCH models for the ARMA residuals. The AICC value adjusted for the ARMA 
fit (821.70) should be used for comparison with alternative ARMA models (with 
or without GARCH noise). Standard errors of the estimated coefficients are also 
displayed. 

Simulation using the fitted ARMA(3,4) model with GARCH (1,1) noise can be 
carried out by selecting Garch>Simulate Garch process. If you retain the settings 
in the ARMA Simulation dialog box and click OK you will see a simulated realization 
of the model for the original data in SUNSPOTS.TSM. 


Some useful references for extensions and further properties of GARCH models 
are Weiss (1986), Engle (1995), Shephard (1996), and Gouriéroux (1997). 


10.4 Continuous-Time Models 


Discrete time series are often obtained by observing a continuous-time process at a 
discrete sequence of observation times. It is then natural, even though the observa- 
tions are made at discrete times, to model the underlying process as a continuous-time 
series. Even if there is no underlying continuous-time process, it may still be advanta- 
geous to model the data as observations of a continuous-time process at discrete times. 
The analysis of time series data observed at irregularly spaced times can be handled 
very conveniently via continuous-time models, as pointed out by Jones (1980). 

Continuous-time ARMA processes are defined in terms of stochastic differential 
equations analogous to the difference equations that are used to define discrete-time 
ARMA processes. Here we shall confine attention to the continuous-time AR(1) pro- 
cess, which is defined as a stationary solution of the first-order stochastic differential 
equation 


DX(t)+aX(t) =oDB(t) +), (10.4.1) 
where the operator D denotes differentiation with respect to t, {B(t)} is standard 
Brownian motion, and a, b, and o are parameters. The derivative DB(t) does not 


exist in the usual sense, so equation (10.4.1) is interpreted as an It6 differential 
equation 


dX (t) +aX(t)dt =odB(t)+bdt, t>0, (10.4.2) 


with d X (t) and d B(t) denoting the increments of X and B in the time interval (t, t+-dt) 
and X(0) a random variable with finite variance, independent of {B(t)} (see, e.g., 
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Chung and Williams, 1990, Karatzas and Shreve, 1991, and Oksendal, 1992). The 
solution of (10.4.2) can be written as 


4 t 
X(t) =e" X0) +0 f e“ dB(u)+b f ena) du, 
6 0 
or equivalently, 
t 
XO) =e "X (0) +e“ TG) + be f e" du, (10.4.3) 
0 


where I(t) =o h e““d B(u) is an Itô integral (see Chung and Williams, 1990) satis- 
fying E (I (t)) = 0 and Cov(/(t + h), I(t)) = o? Í e*" du for all t > O and h > 0. 

If a > 0 and X(0) has mean b/a and variance o?/(2a), it is easy to check 
(Problem 10.9) that {X (t)} as defined by (10.4.3) is stationary with 


2 
E(X(t)) = and Cov(X(t +h), X(t) = ao t,h>0. (10.4.4) 


Conversely, if {X (t)} is stationary, then by equating the variances of both sides of 
(10.4.3), we find that (1 — e~4") Var(X (0)) = o? fy e~" du for all t > 0, and hence 
that a > 0 and Var(X (0)) = o7/(2a). Equating the means of both sides of (10.4.3) 
then gives E(X(0)) = b/a. Necessary and sufficient conditions for {X (t)} to be 
stationary are therefore a > 0, E(X(0)) = b/a, and Var(X (0)) = o7/(2a). Ifa > 0 
and X (0) is N(b/a, o7/(2a)), then the CAR(1) process will also be Gaussian and 
strictly stationary. 
Ifa > Oand0 < s < t, it follows from (10.4.3) that X (t) can be expressed as 


X(t) = eX (s) + ? (1 — e) +e (I) — I(s)). (10.4.5) 


This shows that the process is Markovian, i.e., that the distribution of X(t) given 
X (u), u < s, is the same as the distribution of X (t) given X (s). It also shows that the 
conditional mean and variance of X (t) given X (s) are 


E(X(t)|X(s)) = e“"-9 X (s) + b/a (1 _ got) 


and 
o2 
Var (X (t)|X (s)) == a [1 oe gr] f 


We can now use the Markov property and the moments of the stationary distribu- 
tion to write down the Gaussian likelihood of observations x(t), ..., X(t„) at times 
ti,...,t, of a CAR(1) process satisfying (10.4.1). This is just the joint density of 
(X(t), ..., X) at (x(t), ..., x(t), which can be expressed as the product of the 
stationary density at x (t; ) and the transition densities of X (t;) given X (#;_1) = x(f;-1), 
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i = 2, ..., n. The joint density g is therefore given by 


p Iy, $ 1 x(t) — m;i 
s Edra b0?) = TT ( Ta i} (10.4.6) 


where f(x) = n(x; 0, 1) is the standard normal density, m; = b/a, vı = o7/(2a), 
and fori > 1, 


b , 
m; = etti- y(t) a (1 = as 
a 


and 
g2 
= 5, [1 

The maximum likelihood estimators of a, b, and o? are the values that maxi- 
mize g (x(t1),..., (tn); a, b, o°). These can be found with the aid of a nonlinear 
maximization algorithm. Notice that the times t; appearing in (10.4.6) are quite arbi- 
trarily spaced. It is this feature that makes the CAR(1) process so useful for modeling 
irregularly spaced data. 

If the observations are regularly spaced, say t; = i, i = 1, ...,n, then the joint 
density g is exactly the same as the joint density of observations of the discrete-time 
Gaussian AR(1) process 


b b 2 t= —2a 
Yn ak I race e* (Z a 2) + Zn, {Z} Gg WN Q ae) . 
a a 


= geat men] i 


2a 


This shows that the “embedded” discrete-time process {X (i), i = 1,2,...} of the 
CAR(1) process is a discrete-time AR(1) process with coefficient e~“. This coefficient 
is clearly positive, immediately raising the question of whether there is a continuous- 
time ARMA process for which the embedded process is a discrete-time AR(1) process 
with negative coefficient. It can be shown (Chan and Tong, 1987) that the answer is 
yes and that given a discrete-time AR(1) process with negative coefficient, it can 
always be embedded in a suitably chosen continuous-time ARMA(2,1) process. 

We define a zero-mean CARMA(p, q) process {Y (t)} (with 0 < q < p) to be a 
stationary solution of the pth-order linear differential equation 


DPY) +a DPY) +--+ apY (t) 
= bo DB(t) + bı D?B(t) +- -+ bg D™*'B(t), (10.4.7) 
where DU) denotes j-fold differentiation with respect to t, {B(t)} is standard Brow- 
nian motion, and a1, ..., ap, bo, ..., bg, and c are constants. We assume that b, # 0 
and define b; := 0 for j > q. Since the derivatives D/ B(t), j > 0, do not exist in 


the usual sense, we interpret (10.4.7) as being equivalent to the observation and state 
equations 


Y(t)=bX(t), t>0, (10.4.8) 
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and 
dX(t) = AX(t) dt +edB(t), (10.4.9) 
where 
0 1 0 0 
0 1 0 
A= : : : of |, 
0 0 D Bee a 
ap ap-1 ap-2 SRS —dı 
e=[0 0 - 0 1ļ,b=fb b ©) b,2 bpi l,and(10.4.8) is an Itô 


differential equation for the state vector X(t). (We assume also that X(0) is indepen- 
dent of {B(t)}.) 
The solution of (10.4.9) can be written as 


t 
X(t) = eX (0) + / ee dB(u), 
0 
which is stationary if and only if 


EX0) =[0 OQ... o], 
Cov(X(0)) =f e® eee"? dy, 
0 


and all the eigenvalues of A (i.e., the roots of z? + aız?™! +--+ ap = 0) have 
negative real parts. 

Then {Y (t), t > 0} is said to be a zero-mean CARMA (p, q) process with para- 
meters (a1, .. . , ap, bo, - ivy bq, O, C) if 


YM =| bi o bp bpi XO, 


where {X(t)} is a stationary solution of (10.4.9) and b; := 0 for j > q. 
The autocovariance function of the process X(t) at lag h is easily found to be 


Cov(X(¢ +h), X6) =e}, h>0, 
where 

Lis f eVee'e dy. 
The mean and autocovariance function of the CARMA(p, q) process {Y (t)} are there- 
fore given by 


EY(t) =0 
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and 
Cov(¥ (t + h), Y(t)) = b'e" Xb. 


Inference for continuous-time ARMA processes is more complicated than for 
continuous-time AR(1) processes because higher-order processes are not Marko- 
vian, so the simple calculation that led to (10.4.6) must be modified. However, the 
likelihood of observations at times t,,..., t, can still easily be computed using the 
discrete-time Kalman recursions (see Jones, 1980). 

Continuous-time ARMA processes with thresholds constitute a useful class of 
nonlinear time series models. For example, the continuous-time threshold AR(1) 
process with threshold at r is defined as a solution of the stochastic differential 
equations 


dX(t)+a,X(t)dt =b, dt+o,dB(t), X(t) <r, 
and 
dX (t) + aX (t)dt = bı dt +oodB(t), X(t) >r. 


For a detailed discussion of such processes, see Stramer, Brockwell, and Tweedie 
(1996). Continuous-time threshold ARMA processes are discussed in Brockwell 
(1994) and non-Gaussian CARMA(p, q) processes in Brockwell (2001). For more 
on continuous-time models see Bergstrom (1990) and Harvey (1990). 


10.5 Long-Memory Models 


The autocorrelation function (-) of an ARMA process at lag h converges rapidly to 
zero as h — oo in the sense that there exists r > 1 such that 


r"p(h) > 0 as h> oœ. (10.5.1) 


Stationary processes with much more slowly decreasing autocorrelation function, 
known as fractionally integrated ARMA processes, or more precisely as ARIMA 
(p, d, q) processes with O < |d| < 0.5, satisfy difference equations of the form 


(1 — B)“$(B)X, = 0(B)Z,, (10.5.2) 
where ¢ (z) and 0 (z) are polynomials of degrees p and q, respectively, satisfying 
(z) #0 and (z) #0 forallz such that |z| < 1, 


B is the backward shift operator, and {Z,} is a white noise sequence with mean 0 and 
variance o?. The operator (1 — B)“ is defined by the binomial expansion 


(1— By’ =) 0 0,B’, 
j=0 
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where no = 1 and 


k—l-d 
‘= os EERE 
Tj T : k , J ’ $ 
<k<j 
The autocorrelation p(h) at lag h of an ARIMA (p, d, q) process with 0 < |d| < 0.5 
has the property 


ohh ™™” —> c as h—> oo. (10.5.3) 


This implies (see (10.5.1)) that p(h) converges to zero as h — œ at a much slower 
rate than o(h) for an ARMA process. Consequently, fractionally integrated ARMA 
processes are said to have “long memory.” In contrast, stationary processes whose 
ACF converges to 0 rapidly, such as ARMA processes, are said to have “short mem- 
ory.” 

A fractionally integrated ARIMA(p, d, q) process can be regarded as an ARMA 
(p,q) process driven by fractionally integrated noise; i.e., we can replace equation 


(10.5.2) by the two equations 

(B)X; = O(B)W, (10.5.4) 
and 

(1 — B) W, = Z,. (10.5.5) 


The process {W,} is called fractionally integrated white noise and can be shown 
(see, e.g., TSTM, Section 13.2) to have variance and autocorrelations given by 


„r(A — 2d) 


yw(0) =o raad (10.5.6) 
and 
_Ta+drd—d) _ k—1+d 7 
pw(h) = adira ~ oa h=1,2,..., (10.5.7) 


O<k<h 


where T (-) is the gamma function (see Example (d) of Section A.1). The exact 
autocovariance function of the ARIMA (p, d, q) process {X;} defined by (10.5.2) can 
therefore be expressed, by Proposition 2.2.1, as 


yx(h) =} 9 vi vevwh + j-b), (10.5.8) 
j=0 k=0 


where X, Wiz’ = 0(z)/@(z), |z| < 1, and yw(-) is the autocovariance function of 
fractionally integrated white noise with parameters d and o”, i.e., 


ywth) = yw (0)pow (h), 


with yw(0) and pw(h) as in (10.5.6) and (10.5.7). The series (10.5.8) converges 
rapidly as long as ¢ (z) does not have zeros with absolute value close to 1. 
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Example 10.5.1 


The spectral density of {X,} is given by 


f= aker ee (10.5.9) 

27 lo ià) | 
Calculation of the exact Gaussian likelihood of observations {x,,..., Xn} of a frac- 
tionally integrated ARMA process is very slow and demanding in terms of computer 
memory. Instead of estimating the parameters d, $),...,@p, 1,.-.,9,, and o? by 


maximizing the exact Gaussian likelihood, it is much simpler to maximize the Whittle 
approximation Ly, defined by 
I,(@; 
—2In(Ly) =nInQx)+2nIno +07 Y` mon +% Ing(@;), (10.5.10) 

— §\@; 3 

J J 
where 7, is the periodogram, o*g/(27)(= f) is the model spectral density, and >> j 
denotes the sum over all nonzero Fourier frequencies w; = 2mj/n € (~x, x]. The 
program ITSM estimates parameters for ARIMA(p, d, q) models in this way. It can 
also be used to predict and simulate fractionally integrated ARMA series and to 
compute the autocovariance function of any specified fractionally integrated ARMA 
model. 


Annual Minimum Water Levels; NILE.TSM 


The data file NILE.TSM consists of the annual minimum water levels of the Nile 
river as measured at the Roda gauge near Cairo for the years 622-871. These values 
are plotted in Figure 10.11 with the corresponding sample autocorrelations shown in 
Figure 10.12. The rather slow decay of the sample autocorrelation function suggests 
the possibility of a fractionally intergrated model for the mean-corrected series Y, = 
X, — 1119. 

The ARMA model with minimum (exact) AICC value for the mean-corrected 
series {Y,} is found, using Model>Estimation>Autofit, to be 


Y, = — .323Y,_; — .060Y,—2 + .633Y,_3 + .069Y,_4 + .248Y;_s 
+ Z; + .702Z,_1 + .350Z,_2 — .419Z;_3, (10.5.11) 


with {Z,} ~ WN(O, 5663.6) and AICC= 2889.9. 

To fit a fractionally integrated ARMA model to this series, select the option 
Model>Specify, check the box marked Fractionally integrated model, and 
click on OK. Then select Model>Estimation>Autofit, and click on Start. This 
estimation procedure is relatively slow so the specified ranges for p and q should 
be small (the default is from 0 to 2). When models have been fitted for each value 
of (p,q), the fractionally integrated model with the smallest modified AIC value is 
found to be 


(1 — B)3°(1 — .1694B + .9704B7)Y, = (1 — .1800B + .9278B7)Z,, (10.5.12) 
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Figure 10-11 

Annual minimum water 
levels of the Nile river 
for the years 622-871. 


Figure 10-12 

The sample correlation 
function of the data 

in Figure 10.11. 
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with {Z,} ~ WN (0, 5827.4) and modified AIC= 2884.94. (The modified AIC statis- 
tic for estimating the parameters of a fractionally integrated ARMA(p, q) process is 
defined in terms of the Whittle likelihood Lw as —2In Lw + 2(p +q + 2) if d is 
estimated, and —2InLy +2(p +q + 1) otherwise. The Whittle likelihood was defined 


in (10.5.10).) 


In order to compare the models (10.5.11) and (10.5.12), the modified AIC value 
for (10.5.11) is found as follows. After fitting the model as described above, select 


= 
= 
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Figure 10-13 

The minimum annual 
Nile river levels for the 
years 821-871, with 
20 forecasts based on 
the model (10.5.1 2). 


Model>Specify, check the box marked Fractionally integrated model, set 
d = 0 and click on OK. The next step is to choose Model>Estimation>Max like- 
lihood, check No optimization and click on OK. You will then see the modified 
AIC value, 2884.58, displayed in the ML estimates window together with the value 
2866.58 of —2InLy. 

The ARMA(5,3) model is slightly better in terms of modified AIC than the frac- 
tionally integrated model and its ACF is closer to the sample ACF of the data than 
is the ACF of the fractionally integrated model. (The sample and model autocorrela- 
tion functions can be compared by clicking on the third yellow button at the top of 
the ITSM window.) The residuals from both models pass all of the ITSM tests for 
randomness. 

Figure 10.13 shows the graph of {x299, ... , X250} with predictors of the next 20 
values obtained from the model (10.5.12) for the mean-corrected series. 


10.1. Find a transfer function model relating the input and output series X,, and Xj, 
t = 1,..., 200, contained in the ITSM data files APPJ.TSM and APPK.TSM, 
respectively. Use the fitted model to predict X201,2, X202,2, and X203,2. Compare 
the predictors and their mean squared errors with the corresponding predictors 
and mean squared errors obtained by modeling {X,.} as a univariate ARMA 
process and with the results of Problem 7.7. 
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10.2. 


10.3. 


10.4. 


10.5. 


10.6. 


Verify the calculations of Example 10.2.1 to fit an intervention model to the 
series SB.TSM. 


If {X,} is the linear process (10.3.1) with {Z,} ~ IID (0, o?) and n = EZ}, how 
that the third-order cumulant function of {X,} is given by 


Ctr, s) =n $ Vitis Wiss. 


Use this result to establish equation (10.3.4). Conclude that if {X,} is a Gaussian 


linear process, then C3(r, s) = 0 and f3(w1, œ) = 0. 


Evaluate EZ} for the ARCH(1) process (10.3.10) with 0 < œ; < 1 and {e,} ~ 
IID N(O, 1). Deduce that EX} < oo if and only if 3a7 < 1. 


Let {Z,} be a causal stationary solution of the ARCH(p) equations (10.3.6) 
and (10.3.7) with EZ* < oo. Assuming that such a process exists, show that 
Y, = Z?/ao satisfies the equations 


P 
Y, = e ( + Yar) 
i=l 
and deduce that {Y,} has the same autocorrelation function as the AR (p) process 
p 
W, =) aW te, {e} ~ WNO, 1). 
i=l 


(In the case p = 1, anecessary and sufficient condition for existence of a causal 
stationary solution of (10.3.6) and (10.3.7) with EZ} < oois3a7 < 1, as shown 
by the results of Section 10.3 and Problem 10.4.) 


Suppose that {Z,} is a causal stationary GARCH(p, q) process Z; = he, 
where {e,} ~ IID(0,1), Xj- ai +} 5-1 Bj < 1 and 
hy = ao + Za He + AZ p + Bika t+ +++ + Bahia- 
a. Show that E(Z?|Z? ,, Z?.5,...) = h. 
b. Show that the squared process {Z?} is an ARMA(m, q) process satisfying 
the equations 


z? = A T (a + BDZ? pease (Qn + Bote. 
T U, pE iU-ı ERGE: BaUi-q, 


where m = max{p, q} œ; = O for j > p, b; = 0 for j > q,and U, = Z?—h, 
is white noise if E Zf < oo. 
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c. For p > 1, show that the conditional variance process {h,} is an 
ARMA(m, p — 1) process satisfying the equations 


h, = A T (ay F Dhi Tene (Qn + Bm)hi—m 
T V, +a V TAER + a5 Vi—p-1> 


where V, = ay U;-1 and a =aj41/o for j =1,...,p—1. 


10.7. To each of the seven components of the multivariate time series filed as 
STOCK7.TSM, fit an ARMA model driven by GARCH noise. Compare the 
fitted models for the various series and comment on the differences. (For ex- 
porting components of a multivariate time series to a univariate project, see the 
ITSM Help topic, Project editor.) 


10.8. If a > 0 and X(0) has mean b/a and variance o”/(2a), show that the pro- 
cess defined by (10.4.3) is stationary and evaluate its mean and autocovariance 
function. 


10.9. a. Fit a fractionally integrated ARMA model to the first 230 tree-ring widths 
contained in the file TRINGS.TSM. Use the model to general forecasts and 
95% prediction bounds for the last 20 observations (corresponding to t = 
231,...,250) and plot the entire data set with the forecasts and prediction 
bounds superposed on the graph of the data. 


b. Repeat part (a), but this time fitting an appropriate ARMA model. Compare 
the performance of the two sets of predictors. 
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Random Variables and 
Probability Distributions 


A.1 Distribution Functions and Expectation 
A.2 Random Vectors 
A.3 The Multivariate Normal Distribution 


A.1 Distribution Functions and Expectation 


The distribution function F of a random variable X is defined by 
F(x) = P[X < x] (A.1.1) 
for all real x. The following properties are direct consequences of (A.1.1): 
1. F is nondecreasing, i.e., F(x) < F(y)ifx < y. 


2. F is right continuous, i.e., F(y) | F(x) as y J x. 
3. F(x) > land F(y) > Oas x > oo and y > —on, respectively. 


Conversely, any function that satisfies properties 1-3 is the distribution function of 
some random variable. 

Most of the commonly encountered distribution functions F can be expressed 
either as 


FG) = i TOB (A.1.2) 
or 


F(x)= >> p(x), (A.1.3) 


JixjSx 
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where {xo, x1, X2,...} is a finite or countably infinite set. In the case (A.1.2) we shall 
say that the random variable X is continuous. The function f is called the probability 
density function (pdf) of X and can be found from the relation 


f(x) = F'(x). 


In case (A.1.3), the possible values of X are restricted to the set {xo, x1, ...}, and 
we shall say that the random variable X is discrete. The function p is called the 
probability mass function (pmf) of X, and F is constant except for upward jumps 
of size p(x;) at the points x;. Thus p(x;) is the size of the jump in F at x;, i.e., 


p(xj) = F(x;) — F(Qxj) = PIX = xj], 


where F(x; ) = limyt; FO). 
Examples of Continuous Distributions 


(a) The normal distribution with mean u and variance o?. We say that a random 
variable X has the normal distribution with mean jz and variance o? (written 
more concisely as X ~ N (u, o7)) if X has the pdf given by 


n (x; LL, o°) = (2m) 2g be 0/20) -œ <x <. 


It follows then that Z = (X — u)/o ~ N (0, 1) and that 
Pix sxl=P|z< “=t]=0(==*), 
oO 


oO 


where ®(x) = f> (27) e7 1? dz is known as the standard normal distribu- 
tion function. The significance of the terms mean and variance for the parameters 
u and o? is explained below (see Example A.1.1). 


(b) The uniform distribution on [a,b]. The pdf of a random variable uniformly dis- 
tributed on the interval [a, b] is given by 


1 
u(x;a,b) = 4 b-a’ 
0, otherwise. 


ifa <x <b, 


(c) The exponential distribution with parameter à. The pdf of an exponentially dis- 
tributed random variable with parameter A > 0 is 
0, ifx <0, 
e(x;A) = 


he, ifx>0. 


The corresponding distribution function is 


0, 


l-e”™, ifx>0. 


if x <0, 
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(d) The gamma distribution with parameters a and i. The pdf of a gamma-distributed 


random variable is 
0, ifx <0, 
g(x; a, A) = 
x te Te); if x > 0, 


where the parameters œ and à are both positive and T is the gamma function 
defined as 


ria) = f xe dx. 
0 


Note that f is the exponential pdf when œ = 1 and that when a is a positive 
integer 


I'(a) = (œ — 1)! with 0! defined to be 1. 
(e 


wm 


The chi-squared distribution with v degrees of freedom. For each positive integer 
v, the chi-squared distribution with v degrees of freedom is defined to be the 
distribution of the sum 


X=Zj+---+Z?, 


where Z,,..., Z, are independent normally distributed random variables with 
mean 0 and variance 1. This distribution is the same as the gamma distribution 
with parameters a = v/2 and à = L, 


Examples of Discrete Distributions 


(f) The binomial distribution with parameters n and p. The pmf of a binomially 
distributed random variable X with parameters n and p is 


n\ , , 
bin p) = PIX = j= (")pia = py, j=0,1,...,n, 


where n is a positive integer and 0 < p < 1. 


(g) The uniform distribution on {1,2,...,k}. The pmf of a random variable X uni- 
formly distributed on {1,2,..., k} is 


pG) = PIX =j]l=>7, j=1,2...,k, 


where k is a positive integer. 


(h) The Poisson distribution with parameter à. A random variable X is said to have 
a Poisson distribution with parameter A > 0 if 


PG X) = P[X = j] = —e™, j=0,1,.... 


Random Variables and Probability Distributions 


We shall see in Example A.1.2 below that 4 is the mean of X. 


(i) The negative binomial distribution with parameters «œ and p. The random variable 
X is said to have a negative binomial distribution with parameters a > 0 and 
p € [0, 1] if it has pmf 


i k-l+a ! , 
ap = ÒT] Ja Pye) a Naas, 
k=1 


where the product is defined to be 1 if j = 0. 


Not all random variables can be neatly categorized as either continuous or dis- 
crete. For example, consider the time you spend waiting to be served at a checkout 
counter and suppose that the probability of finding no customers ahead of you is L, 
Then the time you spend waiting for service can be expressed as 


1 
0, with probability 7 

W= i 
W,, with probability 5 


where W; is a continuous random variable. If the distribution of W, is exponential 
with parameter 1, then the distribution function of W is 


0, ifx <0, 
F(x)=41 1 1 
7 + 7 (1 e= e™) =l|- Bes if x = 0. 


This distribution function is neither continuous (since it has a discontinuity at x = 0) 
nor discrete (since it increases continuously for x > 0). It is expressible as a mixture, 


F = pfa + (l — p) Fg, 


with p = }, of a discrete distribution function 


0, x<0, 
Fi = 
1, x>0, 
and a continuous distribution function 
0, x <0, 
F= 
l—e™, x>0. 
Every distribution function can in fact be expressed in the form 
F = pi Fa + pok. + ps Fisc, 


where 0 < pı, P2, p3 < 1, pı + p2 + p3 = 1, Fa is discrete, F, is continuous, and Fy. 
is singular continuous (continuous but not of the form A.1.2). Distribution functions 
with a singular continuous component are rarely encountered. 
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Example A.1.1 


Expectation, Mean, and Variance 


The expectation of a function g of a random variable X is defined by 


EX) = / g(x) dF(2), 
where 


/ g(x) f(x)dx inthe continuous case, 
[sare = 1 © 


8(x;) p(x;) in the discrete case, 
=0 


J 


and g is any function such that E'|g(x)| < oo. (If F isthe mixture F = pF.+(1—p) Fa, 
then E(g(X)) = p f g(x) dF.(x) + (1 — p) f g(x) d Fa(x).) The mean and variance 
of X are defined as u = EX and o? = E(X — u)’, respectively. They are evaluated 
by setting g(x) = x and g(x) = (x — u}? in the definition of E(g(X)). 

It is clear from the definition that expectation has the linearity property 


E(aX +b) =aE(X)+b 


for any real constants a and b (provided that E|X| < oo). 


The normal distribution 


If X has the normal distribution with pdf n (x; l, o°) as defined in Example (a) above, 
then 


E(X — u) af (x — w)n(x; H, o°) dx = -o° f n'(x : h, o°) dx =Q. 
This shows, with the help of the linearity property of E, that 
E(X) =n, 


i.e., that the parameter u is in fact the mean of the normal distribution defined in 
Example (a). Similarly, 


E(X — py = f @&- py n(x; h, o°) dx = -o° | (x-— u)n (x; h, o°) dx. 
Integrating by parts and using the fact that f is a pdf, we find that the variance of X 
is 


(oe) 


E(X- u’ = o f n(x; h, o°) dx = 0°. 


—oo 
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Example A.1.2 The Poisson distribution 


The mean of the Poisson distribution with parameter à (see Example (h) above) is 
given by 


A similar calculation shows that the variance is also equal to A (see Problem A.2). 


Remark. Functions and parameters associated with a random variable X will be 
labeled with the subscript X whenever it is necessary to identify the particular random 
variable to which they refer. For example, the distribution function, pdf, mean, and 
variance of X will be written as Fy, fy, px, and Ol; respectively, whenever it is 
necessary to distinguish them from the corresponding quantities Fy, fy, uy, and ož 
associated with a different random variable Y. 


A.2 Random Vectors 


An n-dimensional random vector is a column vector X = (X,..., X„) each of whose 
components is a random variable. The distribution function F of X, also called the 
joint distribution of X,,..., X,, is defined by 


F(x,,.--5Xn) = P[X1, < 1,.--, Xn < Xp] (A.2.1) 
for all real numbers x1, ..., Xn. This can be expressed in a more compact form as 


F(x) = P[X <x], x=(x,...,%n)’, 


for all real vectors x = (x,...,X,)’. The joint distribution of any subcollection 
X;,,..., X;, of these random variables can be obtained from F by setting x; = oo in 
(A.2.1) for all j ¢ {i1, ..., iz}. In particular, the distributions of X, and (X1, X,,)’ are 
given by 


Fy, (41) = P[X, < xı] = F(%, ,..., 00) 
and 
Fx, x, (41, Xn) = P[X, < x1, Xp < Xn] = F(x, ©, ... 0, Xn). 


As in the univariate case, a random vector with distribution function F is said to be 
continuous if F has a density function, i.e., if 


Fo 1%) = f -f | fOis- --, Yn) dyi dyz- dyn. 
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The probability density of X is then found from 
a" F(x, re) Xn) 
Ox, Ert OXn i 


Fas a) = 


The random vector X is said to be discrete if there exist real-valued vectors Xo, X1, ... 
and a probability mass function p(x;) = P[X = x;] such that 


>S p(x) =1. 
j=0 


The expectation of a function g of a random vector X is defined by 


EGO) = f dF = f grdna), 
where 


E Foie) 


/ + fs% wees Xn) f (X1, ---, Xn)dxı +- -dxXn, in the continuous case, 


X e DDT nr AP Osr in the discrete case, 
Ji Jn 
and g is any function such that E|g(X)| < œœ. 
The random variables X,,..., X, are said to be independent if 

P[X, E Kjaran XH < xn] = P[X, <x,]--- P[X, < xy], 
i.e., 

F(x), -3 Xn) = Fx x1) -e Fx, On) 
for all real numbers x1, ..., Xn. In the continuous and discrete cases, independence 


is equivalent to the factorization of the joint density function or probability mass 
function into the product of the respective marginal densities or mass functions, i.e., 


f,- Xn) = fx 1) e fx, (Xn) (A.2.2) 
or 
Pi, -5 Xn) = Px (%1) +++ Px, Xn). (A.2.3) 
For two random vectors X = (X,,..., X„,y and Y = (%,..., Ym) with joint 
density function fy,y, the conditional density of Y given X = x is 
fxy&æ, y» if fx(x) > 0, 
fxx (yix) = fx (x) 


fx), if fx(x) = 0. 
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Proposition A.2.1 


The conditional expectation of g(Y) given X = x is then 


E(g(Y)IX =x) = / oa: 


If X and Y are independent, then fyix(y|x) = fy(y) by (A.2.2), and so the conditional 
expectation of g(Y) given X = x is 


E(g(Y)|X = x) = E(g(Y)), 


which, as expected, does not depend on x. The same ideas hold in the discrete case 
with the probability mass function assuming the role of the density function. 


Means and Covariances 


If E|X;| < oo for each i, then we define the mean or expected value of X = 
(X,,..., Xay to be the column vector 


ux = EX =(EX,..., EX,). 


In the same way we define the expected value of any array whose elements are random 
variables (e.g., a matrix of random variables) to be the same array with each random 
variable replaced by its expected value (if the expectation exists). 

If X = (X,..., Xa) and Y = (Yj,..., Ym) are random vectors such that each 
X; and Y; has a finite variance, then the covariance matrix of X and Y is defined to 
be the matrix 


Dxy = Cov(X, Y) = E[(X — EX)(Y — EYY] 
= E(XY’) — (EX)(EYY. 


The (i, j) element of Xxy is the covariance Cov(X;, Y;) = E(X;Y;) — E(X;)E(Y)). 
In the special case where Y = X, Cov(X, Y) reduces to the covariance matrix of the 
random vector X. 

Now suppose that Y and X are linearly related through the equation 


Y =a + BX, 


where a is an m-dimensional column vector and B is an m x n matrix. Then Y has 
mean 


EY=a+ BEX (A.2.4) 
and covariance matrix 

Eyy = BuxxB’ (A.2.5) 
(see Problem A.3). 


The covariance matrix Xxx of a random vector X is symmetric and nonnegative 
definite, i.e., b'Xxxb > 0 for all vectors b = (bi, ...,b,)' with real components. 
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Proof Since the (i, j) element of Xxx is Cov(X;, Xj) =Cov(X ;, X;), itis clear that Xxx is 
symmetric. To prove nonnegative definiteness, let b = (b),..., bn) be an arbitrary 
vector. Then applying (A.2.5) with a = 0 and B = b, we have 


b’ Xxxb = Var(b’X) = Var(b) X; +--+ brXn) > 0. | 


Proposition A.2.2 Every n x n covariance matrix £ can be factorized as 
E= PAP’ 


where P is an orthogonal matrix (i.e., P! = P~') whose columns are an orthonormal 
set of right eigenvectors corresponding to the (nonnegative) eigenvalues ài, ...,Àn 
of X, and A is the diagonal matrix 


u å oo 0 
OA abe. 20 


A= s . f é 
| Os 0O xs A | 
In particular, = is nonsingular if and only if all the eigenvalues are strictly positive. 


Proof Every covariance matrix is symmetric and nonnegative definite by Proposition A.2.1, 
and for such matrices the specified factorization is a standard result (see Graybill, 
1983 for a proof). The determinant of an orthogonal matrix is 1, so that det(x) = 
det(P) det(A) det(P) = 4, - - - àn. It follows that X is nonsingular if and only if A; > 0 
for all i. a 


Remark. Given a covariance matrix ©, it is sometimes useful to be able to find a 
square root A = £! with the property that AA’ = >». It is clear from Proposition 
A.2.2 and the orthogonality of P that one such matrix is given by 


Ae = PAPP. 
If È is nonsingular, then we can define 
“'= PAP’, -wK<s<o. 


The matrix ©~'/? defined in this way is then a square root of £~! and also the inverse 
of E1, 


A.3 The Multivariate Normal Distribution 


The multivariate normal distribution is one of the most commonly encountered and 
important distributions in statistics. It plays a key role in the modeling of time series 
data. Let X = (X,,..., X y be a random vector. 
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Definition A.3.1 


X has a multivariate normal distribution with mean p and nonsingular covari- 
ance matrix © = Dyx, written as X ~ N(p, £), if 


fx(&) = 2x)? et £)? exp {-3 -WEZ w| . 


If X ~ N(u, £), we can define a standardized random vector Z by applying the 
linear transformation 


Z= EPX- p), (A.3.1) 


where ©~'/? is defined as in the remark of Section A.2. Then by (A.2.4) and (A.2.5), 
Z has mean 0 and covariance matrix Ezz = E~! E ET! = I, where I, is then xn 
identity matrix. Using the change of variables formula for probability densities (see 
Mood, Graybill, and Boes, 1974), we find that the probability density of Z is 


falz) = (det 5)!” fx (Zz + p) 


1 
= (det £)? 27r)" (det £)! exp {-say Ete} 
= (2n)7"/? exp { — 1z) 
2 


= (anew - 33) ) (e0 {-32i]), 


showing, by (A.2.2), that Z1, ..., Z, are independent N(0, 1) random variables. Thus 
the standardized random vector Z defined by (A.3.1) has independent standard normal 
random components. Conversely, given any n x 1 mean vector ps, a nonsingular n x n 
covariance matrix ©, and ann x 1 vector of standard normal random variables, we can 
construct a normally distributed random vector with mean jz and covariance matrix 
È by defining 


X= D!?Z+ uu. (A.3.2) 
(See Problem A.4.) 


Remark 1. The multivariate normal distribution with mean pu and covariance matrix 
= can be defined, even when È is singular, as the distribution of the vector X in (A.3.2). 
The singular multivariate normal distribution does not have a joint density, since 
the possible values of X — yz are constrained to lie in a subspace of R” with dimension 
equal to rank (£). 
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Remark 2. If X ~N(w, £), B is anm xn matrix, and a is areal m x 1 vector, then 
the random vector 


Y=a+BX 


is also multivariate normal (see Problem A.5). Note that from (A.2.4) and (A.2.5), Y 
has mean a + By and covariance matrix BX B’. In particular, by taking B to be the 
row vector b’ = (b,,..., b,), we see that any linear combination of the components 
of a multivariate normal random vector is normal. Thus b/X = bX; +--+ b, Xn ~ 
N(b' px, b/Xxxb). 


Example A.3.1 The bivariate normal distribution 


Suppose that X = (Xj, X2)’ is a bivariate normal random vector with mean u = 
(441, 42)’ and covariance matrix 


P0102 07 


2 
= | lof | o>0,0>0, -l<p<l. (A.3.3) 


The parameters 0), 02, and p are the standard deviations and correlation of the com- 
ponents X; and X2. Every nonsingular 2-dimensional covariance matrix can be ex- 
pressed in the form (A.3.3). The inverse of X is 


aA ad 
2 =! Oo —0, O. 
x !=(1- °) | Li —1 ee | 
2 


—p0; 05 


and so the pdf of X is given by 


f(x) = (22002 (1 = P) 
= Xı — Mı R 
XxX exp 2 (1 = p?) ( oi ) 
2 (= = Mt) (= = £) l (= = = 
% 01 02 02 ` 


Multivariate normal random vectors have the important property that the condi- 
tional distribution of any set of components, given any other set, is again multivariate 
normal. In the following proposition we shall suppose that the nonsingular normal 
random vector X is partitioned into two subvectors 


x 
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Proposition A.3.1 


Example A.3.2 


Definition A.3.2 


Correspondingly, we shall write the mean and covariance matrix of X as 


a) 
H Yu Ly 
= and >» = 4 
É l | 22 2n l 


where p® = EX and Ej = E (X® — p) (KY — py’ 


i. X® and X® are independent if and only if Dy, = 0. 
ii. The conditional distribution of X given X® = x® is N(u® + En Ez (x — 
pe), Zn — Ly. Dz X21). In particular, 


E(K®|X® =x®) = p? + En Ez (x? — p®). 


The proof of this proposition involves routine algebraic manipulations of the 
multivariate normal density function and is left as an exercise (see Problem A.6). 


For the bivariate normal random vector X in Example A.3.1, we immediately deduce 
from Proposition A.3.1 that X; and X2 are independent if and only if oo;02 = 0 (or 
p = 0, since o; and o, are both positive). The conditional distribution of X, given 
Xa = x is normal with mean 


E(X,|X2 = x2) = Hı + poyoy (x2 — m2) 


and variance 


Var (X1|X> = x2) = of (1 — p°). 


{X,} is a Gaussian time series if all of its joint distributions are multivari- 
ate normal, i.e., if for any collection of integers i,,...,i,, the random vector 
(X;,,..., Xp) has a multivariate normal distribution. 


Remark 3. If {X,} is a Gaussian time series, then all of its joint distributions are 
completely determined by the mean function u(t) = EX, and the autocovariance 
function x (s, t) = Cov(X,, X;). If the process also happens to be stationary, then 
the mean function is constant (u, = u for all t) and «x(t + h, t) = y (h) for all t. In 
this case, the joint distribution of X,,..., X, is the same as that of Xi4n,..., Xnin 
for all integers h and n > 0. Hence for a Gaussian time series strict stationarity is 
equivalent to weak stationarity (see Section 2.1). 
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Problems 


A.1. 


A.2. 


A.3. 


A.4. 


A.5. 


A.6. 
A.7. 


A.8. 


A9. 


Let X have a negative binomial distribution with parameters œ and p, where 
a>Oand0<p<l. 


a. Show that the probability generating function of X (defined as M(s) = 
E(s%)) is 
M(s)= p*d—s+sp)%, O<s<l. 
b. Using the property that M’(1) = E(X) and M"(1) = E(X?) — E(X), show 
that 


E(X)=a(1—p)/p and Var(X) = a(1 — p)/ p’. 


If X has the Poisson distribution with mean A, show that the variance of X is 
also À. 


Use the linearity of the expectation operator for real-valued random variables 
to establish (A.2.4) and (A.2.5). 
If © is ann x n covariance matrix, £! is the square root of £ defined in the 
remark of Section A.2, and Z is an n-vector whose components are independent 
normal random variables with mean 0 and variance 1, show that 

X= D'?Z4+yp 
is a normally distributed random vector with mean p and covariance matrix ©. 


Show that if X is an n-dimensional random vector such that X ~ N(u, ©), B 
is areal m x n matrix, and a is a real-valued m-vector, then 


Y=a+ BX 


is a multivariate normal random vector. Specify the mean and covariance matrix 
of Y. 


Prove Proposition A.3.1. 


Suppose that X = (X1,..., X) ~ N(0, X) with £ nonsingular. Using the 
fact that Z, as defined in (A.3.1), has independent standard normal components, 
show that (X — ys)’X=~!'(X — u) has the chi-squared distribution with n degrees 
of freedom (Section A.1, Example (e)). 


Suppose that X = (X,,...,X,)’ ~ N(w, ©) with © nonsingular. If A is a 
symmetric n x n matrix, show that E (X'AX) = trace(AX) + p’ Zp. 


Suppose that {X,} is a stationary Gaussian time series with mean 0 and autoco- 
variance function y (h). Find E(X,|X,) and Var(X,|X;), 5 £t. 


Statistical Complements 


B.1 Least Squares Estimation 

B.2 Maximum Likelihood Estimation 
B.3 Confidence Intervals 

B.4 Hypothesis Testing 


B.1 Least Squares Estimation 


Consider the problem of finding the “best” straight line 


y=O+Ox 
to approximate observations y1, ..., Yn of a dependent variable y taken at fixed values 
X1,..., Xn Of the independent variable x. The (ordinary) least squares estimates 64, 


6, are defined to be values of 0, 0; that minimize the sum 
S(6o, 1) = Y Oi — o — 01x)? 
i=l 


of squared deviations of the observations y; from the fitted values 6) + 0,x;. (The 
“sum of squares” S(4, 01) is identical to the Euclidean squared distance between y 
and 61 + 01x, i.e., 


S(6, 1) = lly — 1 — x’, 


where x = (x1, ..., Xn), l= (1,..., 1)’, and y = (y1, ..., y,)’.) Setting the partial 
derivatives of S with respect to 4) and 6, both equal to zero shows that the vector 
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Example B.1.1 


ð = (60, 61)’ satisfies the “normal equations” 
X'XÔ = X'y, 


where X is the n x 2 matrix X = [1, x]. Since 0 < S(0) and S(0) —> œas ||ð|| > ov, 
the normal equations have at least one solution. If 0 and 6 are two solutions of 
the normal equations, then a simple calculation shows that 


AW _ 92) yy (QW _ A@) — 
(ô 6°) x'x (ô 6°) 0, 


i.e., that XO = XÔ®. The solution of the normal equations is unique if and only if 
the matrix X’X is nonsingular. But the preceding calculations show that even if X'X 
is singular, the vector y = X 6 of fitted values is the same for any solution 6 of the 
normal equations. 

The argument just given applies equally well to least squares estimation for the 
general linear model. Given a set of data points 


(Kit, Xi2, +++ Xims Yi), i=1,...,nwithm <n, 
the least squares estimate, Ô = (is ices Ôn) of 8 = (0;,..., Om)’ minimizes 
2 2 
S(0) = X Oi = AXi1 ee Om Xim) = lly = 0x” N a! Onx™ | , 
i=l 
where y = (y1,.--, Yn)’ and x) = (x1;,...,%nj)', j = 1,...,m. As in the previous 
special case, 0 satisfies the equations 
X'XÔ = X'y, 
where X is the n x m matrix X = [x®, ...,x™]. The solution of this equation is 


unique if and only if X’X nonsingular, in which case 
Ô = (XX) ' X'y. 


If X’X is singular, there are infinitely many solutions Ô, but the vector of fitted values 
X@ is the same for all of them. 


To illustrate the general case, let us fit a quadratic function 
y = Oy + Ox + 02x? 


to the data 
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The matrix X for this problem is 


1 0 0 
1 7 i 124 —108 20 
X=|1 2 4 , giving (X'X)"' = -5g —108 174 —40 
E 3 9 20-40 10 
1 4 16 


The least squares estimate Ô = (0, ĝi, 62) is therefore unique and given by 


. 0.6 
0 = (X'X)' X'y = | —0.1 
0.5 
The vector of fitted values is given by 
$ = XÔ = (0.6, 1, 2.4, 4.8, 8.2)’ 


as compared with the observed values 


y = (1,0,3, 5, 8)’. 


B.1.1 The Gauss-Markov Theorem 


Suppose now that the observations y1, ..., Yn are realized values of random variables 
Y\,..., Y, satisfying 


Y; = 0X1 apse ob OmXim F Zi, 


where Z; ~ WN (0, o°). Letting Y = (Y1, ..., Y,)’ and Z = (Zj,..., Z,)’, we can 
write these equations as 


Y= X0+Z. 


Assume for simplicity that the matrix X’X is nonsingular (for the general case see, 
e.g., Silvey, 1975). Then the least squares estimator of 0 is, as above, 


6 = (X'X)  X’Y, 
and the least squares estimator of the parameter o° is the unbiased estimator 


2 = JY- xd]? 
n— m 


It is easy to see that Ô is also unbiased, i.e., that 
E(6) =8. 


It follows at once that if c’@ is any linear combination of the parameters 6;, i = 
1,...,m, then cÔ is an unbiased estimator of e’@. The Gauss—Markov theorem says 
that of all unbiased estimators of ¢’6 of the form }~"_, a;Y;, the estimator c'Ô has the 
smallest variance. 
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In the special case where Z,,..., Z, are HD N(0, 0°), the least squares estimator 
Ô has the distribution N(0, o7(X’X)~'), and (n — m)6é?/o? has the x? distribution 
with n — m degrees of freedom. 


B.1.2 Generalized Least Squares 


The Gauss—Markov theorem depends on the assumption that the errors Z,,..., Zn 
are uncorrelated with constant variance. If, on the other hand, Z = (Z),..., Zay 
has mean 0 and nonsingular covariance matrix 07 X= where = 4 I, we consider the 
transformed observation vector U = R~'Y, where R is a nonsingular matrix such 
that RR’ = £. Then 


U=R'X04+W=MO+W, 


where M = R~!X and W has mean 0 and covariance matrix 077. The Gauss—Markov 
theorem now implies that the best linear estimate of any linear combination c’@ is 
c’0, where @ is the generalized least squares estimator, which minimizes 


IU — MA\/?. 
In the special case where Z,,..., Z, are uncorrelated and Z; has mean 0 and variance 


o’r?, the generalized least squares estimator minimizes the weighted sum of squares 


n 


1 2 
XO (Ki ixn = + = OmXim)?. 
r 


i=l] i 


In the general case, if X’X and È are both nonsingular, the generalized least squares 
estimator is given by 


6=(M'M)'M'U. 


Although the least squares estimator (X’X)~!X’Y is unbiased if E(Z) = 0, even 
when the covariance matrix of Z is not equal to o7/, the variance of the correspond- 
ing estimate of any linear combination of 04, ..., Om is greater than or equal to the 
estimator based on the generalized least squares estimator. 


B.2 Maximum Likelihood Estimation 


The method of least squares has an appealing intuitive interpretation. Its application 
depends on knowledge only of the means and covariances of the observations. Max- 
imum likelihood estimation depends on the assumption of a particular distributional 
form for the observations, known apart from the values of parameters 6), ..., Om. We 
can regard the estimation problem as that of selecting the most appropriate value of 
a parameter vector 0, taking values in a subset © of R”. We suppose that these distri- 
butions have probability densities p(x; 0), 0 € ©. For a fixed vector of observations 
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x, the function L(0) = p(x; 0) on © is called the likelihood function. A maximum 
likelihood estimate 0(x) of @ is a value of 0 € © that maximizes the value of L (0) 
for the given observed value x, i.e., 
L(@) = p(x; @(x)) = ; 0). 
(8) = p(x; OŒ) = max p(x; 0) 
Example B.2.1 Ifx = (x1,...,X,)’ is a vector of observations of independent N(u, o?) random 


variables, the likelihood function is 


1 1 n 
L(u,o°) = -z exp =u |. eos oe, o>0. 
ee) (2102)"” l 20? 2 
Maximization of L with respect to u and ø is equivalent to minimization of 
1 n 
AE 2 
—2In L (u, o°) = nIn(2x) + 2nIn(o) + = Dee —p). 


Setting the partial derivatives of —2 ln L with respect to u and o both equal to zero 
gives the maximum likelihood estimates 


1g 1X 
~l=x=-— Xi and 0 Z xi — 7). 
a> nee? 


B.2.1 Properties of Maximum Likelihood Estimators 


The Gauss—Markov theorem lent support to the use of least squares estimation by 
showing its property of minimum variance among unbiased linear estimators. Maxi- 
mum likelihood estimators are not generally unbiased, but in particular cases they can 
be shown to have small mean squared error relative to other competing estimators. 
Their main justification, however, lies in their good large-sample behavior. 

For independent and identically distributed observations with true probability 
density p(-; Oo) satisfying certain regularity conditions, it can be shown that the 
maximum likelihood estimator Ê of 0 converges in probability to @) and that the 
distribution of y/n (ô — 0o) is approximately normal with mean 0 and covariance 
matrix 7 (@)~', where 7 (0) is Fisher’s information matrix with (i, j) component 


Eo E 0) ðln p(X; >| 
06; 00; 

In time series analysis the situation is rather more complicated than in the case 
of iid observations. “Likelihood” in the time series context is almost always used in 
the sense of Gaussian likelihood, i.e., the likelihood computed under the (possibly 
false) assumption that the series is Gaussian. Nevertheless, estimators of ARMA 
coefficients computed by maximization of the Gaussian likelihood have good large- 


sample properties analogous to those described in the preceding paragraph. For details 
see TSTM, Section 10.8. 
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B.3 Confidence Intervals 


Example B.3.1 


Estimation of a parameter or parameter vector by least squares or maximum likelihood 
leads to a particular value, often referred to as a point estimate. It is clear that this 
will rarely be exactly equal to the true value, and so it is important to convey some 
idea of the probable accuracy of the estimator. This can be done using the notion of 
confidence interval, which specifies a random set covering the true parameter value 
with some specified (high) probability. 


If X = (X,..., X,Y is a vector of independent N(u, o°) random variables, we saw 
in Section B.2 that the random variable X„ = + )*_, X; is the maximum likelihood 
estimator of u. This is a point estimator of u. To construct a confidence interval for 
u from X,,, we observe that the random variable 


Xr =H 
S/J/n 
has Student’s f-distribution with n — 1 degrees of freedom, where S is the sample 
standard deviation, i.e., S? = + X; (X; — Xn) . Hence, 


P t Z t =l 
=t- < ——— < tan] =la, 
1-a/2 SJ 1—a/2 


where ti—a/2 denotes the (1 — œ/2) quantile of the t-distribution with n — 1 degrees 
of freedom. This probability statement can be expressed in the form 


P| X, — t-ap S/n < u < Xn + ta2S//n | =l-a, 


which shows that the random interval bounded by Xr tiy /25/~/n includes the true 
value u with probability 1 — aw. This interval is called a (1 — œ) confidence interval 
for the mean ju. 


B.3.1 Large-Sample Confidence Regions 


Many estimators of a vector-valued parameter 0 are approximately normally dis- 
tributed when the sample size n is large. For example, under mild regularity condi- 
tions, the maximum likelihood estimator 0(X) of 0 = (6), ..., 9)’ is approximately 
N(0, 1] (0)~'), where / (@) is the Fisher information defined in Section B.2. Conse- 
quently, 

n(@— 0)'1(6)( — 0) 
is approximately distributed as x? with m degrees of freedom, and the random set of 
6-values defined by 

n(9 — 8) 1(8)(8 — Ô) < xia (0m) 


covers the true value of 0 with probability approximately equal to 1 — a. 
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Example B.3.2 For iid observations X4, ..., X, from N(x, o°), a straightforward calculation gives, 
for 0 = (u, a*)., 


o7? 0 
w-[% Sa) 
Thus we obtain the large-sample confidence region for (u, @’)., 
— ‘a2 aa N A 
n(u—Xn) /6? + no? — 6*)"/ (284) < Xia (2), 


which covers the true value of 0 with probability approximately equal to 1 — a. This 
region is an ellipse centered at (X,, 6”). 


B.4 Hypothesis Testing 


Example B.4.1 


Parameter estimation can be regarded as choosing one from infinitely many possible 
decisions regarding the value of a parameter vector 0. Hypothesis testing, on the other 
hand, involves a choice between two alternative hypotheses, a “null” hypothesis Ho 
and an “alternative” hypothesis H,, regarding the parameter vector 0. The hypotheses 
Ho and H; correspond to subsets ©, and ©, of the parameter set ©. The problem 
is to decide, on the basis of an observed data vector X, whether or not we should 
reject the null hypothesis Ho. A statistical test of Ho can therefore be regarded as a 
partition of the sample space into one set of values of X for which we reject Ho and 
another for which we do not. The problem is to specify a test (i.e., a subset of the 
sample space called the “rejection region”) for which the corresponding decision rule 
performs well in practice. 


If X = (X,,..., X,)’ is a vector of independent N(jz, 1) random variables, we may 
wish to test the null hypothesis Ho: u = O against the alternative Hı: u 4 0. A 
plausible choice of rejection region in this case is the set of all samples X for which 
x > c for some suitably chosen constant c. We shall return to this example after 
considering those factors that should be taken into account in the systematic selection 
of a “good” rejection region. 


B.4.1 Error Probabilities 


There are two types of error that may be incurred in the application of a statistical 
test: 


e type I error is the rejection of Hp when it is true. 


e type II error is the acceptance of Hy when it is false. 
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For a given test (i.e., for a given rejection region R), the probabilities of error can 
both be found from the power function of the test, defined as 


Po(R), OE, 


where Po is the distribution of X when the true parameter value is 0. The probabilities 
of a type I error are 


a(@) = Pa(R), 8 € Oo, 
and the probabilities of a type II error are 
BO) =1— PR), OE. 


It is not generally possible to find a test that simultaneously minimizes a (0) and £ (0) 
for all values of their arguments. Instead, therefore, we seek to limit the probability 
of type I error and then, subject to this constraint, to minimize the probability of type 
II error uniformly on ©,. Given a significance level a, an optimum level-q test is a 
test satisfying 


a(9)<a, forall ð € Oo, 


that minimizes (0) for every 9 € ©,. Such a test is called a uniformly most 
powerful (U.M.P.) test of level a. The quantity süUPg<o, & (8) is called the size of the 
test. 

In the special case of a simple hypothesis vs. a simple hypothesis, e.g., Ho: 0 = 00 
vs. Hı: 0 = 6, an optimal test based on the likelihood ratio statistic can be constructed 
(see Silvey, 1975). Unfortunately, it is usually not possible to find a uniformly most 
powerful test of a simple hypothesis against a composite (more than one value of 0) 
alternative. This problem can sometimes be solved by searching for uniformly most 
powerful tests within the smaller classes of unbiased or invariant tests. For further 
information see Lehmann (1986). 


B.4.2 Large-Sample Tests Based on Confidence Regions 


There is a natural link between the testing of a simple hypothesis Ho: @ = 6o vs. Hi: 
0 Æ @ and the construction of confidence regions. To illustrate this connection, sup- 
pose that 8 is an estimator of 0 whose distribution is approximately N(9, ae (6)), 


where J (0) is a positive definite matrix. This is usually the case, for example, when 6 
is a maximum likelihood estimator and / (0) is the Fisher information. As in Section 
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B.3.1, we have 
Po(n(0 — 6)'1(8) (0 — Ô) < x2.,(m)) = 1 — a. 

Consequently, an approximate œ-level test is to reject Ho if 
n(B» — 8) 1(8)(B» — Ô) > x?-a 0M), 

or equivalently, if the confidence region determined by those 6’s satisfying 
n(6 — 6) 1(8)(9 — 8) < xia (m) 

does not include 6. 


Example B.4.2 Consider again the problem described in Example B.4.1. Since X, ~N(u, n™!), the 
hypothesis Ho: u = 0 is rejected at level a if 


42 
n (Xa) > Meat 
or equivalently, if 


_ O,_ 
|X,| > ta 


Mean Square Convergence 


C.1 The Cauchy Criterion 


The sequence S, of random variables is said to converge in mean square to the random 
variable S if 


E(S, — SF > 0 asn > œ. 


In particular, we say that the sum }°7_, Xx converges (in mean square) if there exists 


a random variable S such that E(X}; Xz — sy’ — 0 asn — œ. If this is the case, 
then we use the notation S = )°* , Xx. 


C.1 The Cauchy Criterion 


Example C.1.1 


For a given sequence S, of random variables to converge in mean square to some 
random variable, it is necessary and sufficient that 
E(Sin — Sa)? — 0 asm,n —> œ 


(for a proof of this see TSTM, Chapter 2). The point of the criterion is that it permits 
checking for mean square convergence without having to identify the limit of the 
sequence. 


Consider the sequence of partial sums S, = 7), a,Z;,n = 1,2,..., where {Z,} ~ 
WN (0, o°). Under what conditions on the coefficients a; does this sequence converge 


394 


Appendix C 


Mean Square Convergence 


Proof 


in mean square? To answer this question we apply the Cauchy criterion as follows. 
Forn >m > 0, 


2 
E(S, — Sm)? = E ( D a) Sor). ae, 
m<|i|<n m<|i|<n 
Consequently, E (Sn — Sm)? > 0 if and only if J` „ajzu 47 — 0. Since the Cauchy 
criterion applies also to real-valued sequences, this last condition is equivalent to 


convergence of the sequence )~"__, a?, or equivalently to the condition 


i=—n “i? 
œœ 
2 
a; < O0. 


i=—00 


Properties of Mean Square Convergence: 

If X, — X and Y, — Y, in mean square as n —> œ, then 
(a) E(X;,) > E(X’), 
b) E(X,) > E(X), 

and 


(c) E(X,Y,) > E(XY). 


See TSTM, Proposition 2.1.2. a 


An ITSM Tutorial 


D.1 Getting Started 

D.2 Preparing Your Data for Modeling 
D.3 Finding a Model for Your Data 
D.4 Testing Your Model 

D.5 Prediction 

D.6 Model Properties 

D.7 Multivariate Time Series 


The package ITSM2000, the student version of which is included with this book, 
requires an IBM-compatible PC operating under Windows 95, NT, version 4.0 or a 
later version of either of these operating systems. To install the package, copy the 
folder ITSM2000 from the CD-ROM to any convenient location on your hard disk. 
To run the program, you can either double-click on the icon ITSM.EXE in the folder 
ITSM2000 or, on the Windows task bar, left-click on Start, select Run, enter the 
location and name of the file ITSM.EXE (e.g. C:\ITSM2000\ITSM.EXE) and click 
on OK. You may find it convenient to create a shortcut on your desktop by right- 
clicking on the ITSM.EXE icon and selecting Create shortcut. Then right-click 
on the shortcut icon, drag it to your desktop, and select Move here. The program 
can then be run at any time by double-clicking on the shortcut icon. The program 
can also be run directly from the CD-ROM by opening the folder ITSM2000 and 
double-clicking on the icon ITSM.EXE. The package ITSM2000 supersedes earlier 
versions of the package ITSM distributed with this book. 
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Getting Started 


D.1.1 Running ITSM 


Double-click on the icon labeled ITSM.EXE, and the ITSM window will open. Se- 
lecting the option Help>Contents will show you the topics for which explanations 
and examples are provided. Clicking on Index at the top of the Help window will 
allow you to find more specific topics. Close the Help window by clicking on the 
X at its top right corner. To begin analyzing one of the data sets provided, select 
File>Project>Open at the top left corner of the ITSM window. 

There are several distinct functions of the program ITSM. The first is to analyze 
and display the properties of time series data, the second is to compute and display the 
properties of time series models, and the third is to combine these functions in order 
to fit models to data. The last of these includes checking that the properties of the 
fitted model match those of the data in a suitable sense. Having found an appropriate 
model, we can (for example) then use it in conjunction with the data to forecast future 
values of the series. Sections D.2—D.5 of this appendix deal with the modeling and 
analysis of data, while Section D.6 is concerned with model properties. Section D.7 
explains how to open multivariate projects in ITSM. Examples of the analysis of 
multivariate time series are given in Chapter 7. 

It is important to keep in mind the distinction between data and model properties 
and not to confuse the data with the model. In any one project ITSM stores one data 
set and one model (which can be identified by highlighting the project window and 
pressing the red INFO button at the top of the ITSM window). Until a model is entered 
by the user, ITSM stores the default model of white noise with variance 1. If the data 
are transformed (e.g., differenced and mean-corrected), then the data are replaced 
in ITSM by the transformed data. (The original data can, however, be restored by 
inverting the transformations.) Rarely (if ever) is a real time series generated by a 
model as simple as those used for fitting purposes. In model fitting the objective is to 
develop a model that mimics important features of the data, but is still simple enough 
to be used with relative ease. 

The following sections constitute a tutorial that illustrates the use of some of 
the features of ITSM by leading you through a complete analysis of the well-known 
airline passenger series of Box and Jenkins (1976) filed as AIRPASS.TSM in the 
ITSM2000 folder. 


D.2 Preparing Your Data for Modeling 


The observed values of your time series should be available in a single-column ASCII 
file (or two columns for a bivariate series). The file, like those provided with the pack- 
age, should be given a name with suffix .TSM. You can then begin model fitting with 
ITSM. The program will read your data from the file, plot it on the screen, compute 
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Example D.2.1 


Example D.2.2 


sample statistics, and allow you to make a number of transformations designed to 
make your transformed data representable as a realization of a zero-mean stationary 
process. 


To illustrate the analysis we shall use the file AIRPASS.TSM, which contains the 
number of international airline passengers (in thousands) for each month from Jan- 
uary, 1949, through December, 1960. 


D.2.1 Entering Data 


Once you have opened the ITSM window as described above under Getting Started, 
select the options File>Project>Open, and you will see a dialog box in which you 
can check either Univariate or Multivariate. Since the data set for this example 
is univariate, make sure that the univariate option is checked and then click OK. 
A window labeled Open File will then appear, in which you can either type the 
name AIRPASS.TSM and click Open, or else locate the icon for AIRPASS.TSM 
in the Open File window and double-click on it. You will then see a graph of the 
monthly international airline passenger totals (measured in thousands) X\,..., Xn, 
with n = 144. Directly behind the graph is a window containing data summary 
Statistics. 

An additional, second, project can be opened by repeating the procedure described 
in the preceding paragraph. Alternatively, the data can be replaced in the current 
project using the option File>Import File. This option is useful if you wish to 
examine how well a fitted model represents a different data set. (See the entry Project 
Editor in the ITSM Help Files for information on multiple project management. Each 
ITSM project has its own data set and model.) For the purpose of this introduction 
we shall open only one project. 


D.2.2 Information 


If, with the window labeled AIRPASS.TSM highlighted, you press the red INFO 
button at the top of the ITSM window, you will see the sample mean, sample variance, 
estimated standard deviation of the sample mean, and the current model (white noise 
with variance 1). 


Go through the steps in Entering Data to open the project AIRPASS.TSM and use 
the INFO button to determine the sample mean and variance of the series. 


D.2.3 Filing Data 


You may wish to transform your data using ITSM and then store it in another file. At 
any time before or after transforming the data in ITSM, the data can be exported to a 
file by clicking on the red Export button, selecting Time Series and File, clicking 
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Example D.2.4 


Example D.2.5. 


OK, and specifying a new file name. The numerical values of the series can also be 
pasted to the clipboard (and from there into another document) in the same way by 
choosing Clipboard instead of File. Other quantities computed by the program 
(e.g., the residuals from the current model) can be filed or pasted to the clipboard in 
the same way by making the appropriate selection in the Export dialog box. Graphs 
can also be pasted to the clipboard by right-clicking on them and selecting Copy to 
Clipboard. 


Copy the series AIRPASS.TSM to the clipboard, open Wordpad or some convenient 
screen editor, and choose Edit>Paste to insert the series into your new document. 
Then copy the graph of the series to the clipboard and insert it into your document 
in the same way. 


D.2.4 Plotting Data 


A time series graph is automatically plotted when you open a data file (with time 
measured in units of the interval between observations, i.e., t = 1,2, 3,...). To see 
a histogram of the data press the rightmost yellow button at the top of the ITSM 
screen. If you wish to adjust the number of bins in the histogram, select Statis- 
tics>Histogram>Set Bin Count and specify the number of bins required. The 
histogram will then be replotted accordingly. 

To insert any of the ITSM graphs into a text document, right-click on the graph 
concerned, select Copy to Clipboard, and the graph will be copied to the clipboard. 
It can then be pasted into a document opened by any standard text editor such as MS- 
Word or Wordpad using the Edit>Paste option in the screen editor. The graph can 
also be sent directly to a printer by right-clicking on the graph and selecting Print. 
Another useful graphics feature is provided by the white Zoom buttons at the top of 
the ITSM screen. The first and second of these enable you to enlarge a designated 
segment or box, respectively, of any of the graphs. The third button restores the 
original graph. 


Continuing with our analysis of AIRPASS.TSM, press the yellow histogram but- 
ton to see a histogram of the data. Replot the histogram with 20 bins by selecting 
Statistics>Histogram>Set Bin Count. 


D.2.5 Transforming Data 


Transformations are applied in order to produce data that can be successfully modeled 
as “stationary time series.” In particular, they are used to eliminate trend and cyclic 
components and to achieve approximate constancy of level and variability with time. 


The airline passenger data (see Figure 9.4) are clearly not stationary. The level and 
variability both increase with time, and there appears to be a large seasonal component 
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(with period 12). They must therefore be transformed in order to be represented as 
a realization of a stationary time series using one or more of the transformations 
available for this purpose in ITSM. 


Box—Cox Transformations 


Box-—Cox transformations are performed by selecting Transform>Box-Cox and 
specifying the value of the Box—Cox parameter i. If the original observations are 
Yi, Y2,..., Y,, the Box—Cox transformation f, converts them to f,(Y1), A2), ..., 
fan), where 


AQ) = À 
log(y), A=0. 


These transformations are useful when the variability of the data increases or 
decreases with the level. By suitable choice of A, the variability can often be made 
nearly constant. In particular, for positive data whose standard deviation increases 
linearly with level, the variability can be stabilized by choosing A = 0. 

The choice of A can be made visually by watching the graph of the data when 
you click on the pointer in the Box—Cox dialog box and drag it back and forth along 
the scale, which runs from zero to 1.5. Very often it is found that no transformation 
is needed or that the choice A = 0 is satisfactory. 


For the series AIRPASS.TSM, the variability increases with level, and the data are 
strictly positive. Taking natural logarithms (i.e., choosing a Box—Cox transformation 
with à = 0) gives the transformed data shown in Figure D.1. 

Notice how the amplitude of the fluctuations no longer increases with the level of 
the data. However, the seasonal effect remains, as does the upward trend. These will 
be removed shortly. The data stored in ITSM now consist of the natural logarithms 
of the original data. 


Classical Decompositon 


There are two methods provided in ITSM for the elimination of trend and seasonality. 
These are: 


i. “classical decomposition” of the series into a trend component, a seasonal com- 
ponent, and a random residual component, and 
ii. differencing. 
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Figure D-1 
The series AIRPASS.TSM 
after taking logs. 
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Classical decomposition of the series {X,} is based on the model 


X,=m+s,+ Y;, 


where X, is the observation at time t, m, is a “trend component,” s, is a “seasonal 
component,’ and Y, is a “random noise component,” which is stationary with mean 
zero. The objective is to estimate the components m, and s, and subtract them from 
the data to generate a sequence of residuals (or estimated noise) that can then be 
modeled as a stationary time series. 

To achieve this, select Transform>Classical and you will see the Classical 
Decomposition dialog box. To remove a seasonal component and trend, check the 
Seasonal Fit and Polynomial Fit boxes, enter the period of the seasonal com- 
ponent, and choose between the alternatives Quadratic Trendand Linear Trend. 
Click OK, and the trend and seasonal components will be estimated and removed from 
the data, leaving the estimated noise sequence stored as the current data set. 

The estimated noise sequence automatically replaces the previous data stored in 
ITSM. 


The logged airline passenger data have an apparent seasonal component of period 
12 (corresponding to the month of the year) and an approximately quadratic trend. 
Remove these using the option Transform>Classical as described above. (An 
alternative approach is to use the option Regression, which allows the specification 
and fitting of polynomials of degree up to 10 and a linear combination of up to 4 sine 
waves.) 

Figure D.2 shows the transformed data (or residuals) Y,, obtained by removal 
of trend and seasonality from the logged AIRPASS.TSM series by classical decom- 
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Figure D-2 
The logged AIRPASS.TSM $| 
series after removal of trend 9 


and seasonal components 
by classical decomposition. 
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position. {Y,} shows no obvious deviations from stationarity, and it would now be 
reasonable to attempt to fit a stationary time series model to this series. To see how 
well the estimated seasonal and trend components fit the data, select Transf orm>Show 
Classical Fit. We shall not pursue this approach any further here, but turn instead 
to the differencing approach. (You should have no difficulty in later returning to this 
point and completing the classical decomposition analysis by fitting a stationary time 
series model to {Y,}.) 


Differencing 


Differencing is a technique that can also be used to remove seasonal components and 
trends. The idea is simply to consider the differences between pairs of observations 
with appropriate time separations. For example, to remove a seasonal component of 
period 12 from the series {X,}, we generate the transformed series 


Y, = X, = X12- 


It is clear that all seasonal components of period 12 are eliminated by this transfor- 
mation, which is called differencing at lag 12. A linear trend can be eliminated by 
differencing at lag 1, and a quadratic trend by differencing twice at lag 1 (i.e., differ- 
encing once to get a new series, then differencing the new series to get a second new 
series). Higher-order polynomials can be eliminated analogously. It is worth noting 
that differencing at lag 12 eliminates not only seasonal components with period 12 
but also any linear trend. 

Data are differenced in ITSM by selecting Transform>Difference and entering 
the required lag in the resulting dialog box. 
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Figure D-3 

The series AIRPASS.TSM 
after taking logs 

and differencing 

at lags 12 and 1. 


Restore the original airline passenger data using the option File>Import File and 
selecting AIRPASS.TSM. We take natural logarithms as in Example D.2.6 by se- 
lecting Transform>Box-Cox and setting à = 0. The transformed series can now be 
deseasonalized by differencing at lag 12. To do this select Transform>Difference, 
enter the lag 12 in the dialog box, and click OK. Inspection of the graph of the desea- 
sonalized series suggests a further differencing at lag 1 to eliminate the remaining 
trend. To do this, repeat the previous step with lag equal to 1 and you will see the 
transformed and twice-differenced series shown in Figure D.3. 


Subtracting the Mean 


The term ARMA model is used in ITSM to denote a zero-mean ARMA process 
(see Definition 3.1.1). To fit such a model to data, the sample mean of the data 
should therefore be small. Once the apparent deviations from stationarity of the data 
have been removed, we therefore (in most cases) subtract the sample mean of the 
transformed data from each observation to generate a series to which we then fit a 
zero-mean stationary model. Effectively we are estimating the mean of the model by 
the sample mean, then fitting a (zero-mean) ARMA model to the “mean-corrected” 
transformed data. If we know a priori that the observations are from a process with 
zero mean, then this process of mean correction is omitted. ITSM keeps track of all 
the transformations (including mean correction) that are made. When it comes time to 
predict the original series, ITSM will invert all these transformations automatically. 


Subtract the mean of the transformed and twice-differenced series AIRPASS.TSM by 
selecting Transform>Subtract Mean. To check the current model status press the 
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red INFO button, and you will see that the current model is white noise with variance 
1, since no model has yet been entered. 


D.3 Finding a Model for Your Data 


After transforming the data (if necessary) as described above, we are now in a position 
to fitan ARMA model. ITSM uses a variety of tools to guide us in the search for 
an appropriate model. These include the sample ACF (autocorrelation function), 
the sample PACF (partial autocorrelation function), and the AICC statistic, a bias- 
corrected form of Akaike’s AIC statistic (see Section 5.5.2). 


D.3.1 Autofit 


Before discussing the considerations that go into the selection, fitting, and checking 
of a stationary time series model, we first briefly describe an automatic feature of 
ITSM that searches through ARMA(p, q) models with p and q between specified 
limits (less than or equal to 27) and returns the model with smallest AICC value 
(see Sections 5.5.2 and D.3.5). Once the data set is judged to be representable by a 
stationary model, select Model>Estimation>Autofit. A dialog box will appear in 
which you must specify the upper and lower limits for p and q. Since the number of 
maximum likelihood models to be fitted is the product of the number of p-values and 
the number of q-values, these ranges should not be chosen to be larger than necessary. 
Once the limits have been specified, press Start, and the search will begin. You can 
watch the progress of the search in the dialog box that continually updates the values 
of p and q and the best model found so far. This option does not consider models 
in which the coefficients are required to satisfy constraints (other than causality) and 
consequently does not always lead to the optimal representation of the data. However, 
like the tools described below, it provides valuable information on which to base the 
selection of an appropriate model. 


D.3.2 The Sample ACF and PACF 


Pressing the second yellow button at the top of the ITSM window will produce graphs 
of the sample ACF and PACF for values of the lag h from 1 up to 40. For higher lags 
choose Statistics>ACF/PACF>Specify Lag, enter the maximum lag required, and 
click OK. Pressing the second yellow button repeatedly then rotates the display through 
ACF, PACF, and side-by-side graphs of both. Values of the ACF that decay rapidly as 
h increases indicate short-term dependency in the time series, while slowly decaying 
values indicate long-term dependency. For ARMA fitting it is desirable to have a 
sample ACF that decays fairly rapidly. A sample ACF that is positive and very slowly 
decaying suggests that the data may have a trend. A sample ACF with very slowly 
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damped periodicity suggests the presence of a periodic seasonal component. In either 
of these two cases you may need to transform your data before continuing. 

As a rule of thumb, the sample ACF and PACF are good estimates of the ACF 
and PACF of a stationary process for lags up to about a third of the sample size. It is 
clear from the definition of the sample ACF, (A), that it will be a very poor estimator 
of o(h) for h close to the sample size n. 

The horizontal lines on the graphs of the sample ACF and PACF are the bounds 
+1.96/./n. If the data constitute a large sample from an independent white noise 
sequence, approximately 95% of the sample autocorrelations should lie between 
these bounds. Large or frequent excursions from the bounds suggest that we need a 
model to explain the dependence and sometimes to suggest the kind of model we need 
(see below). To obtain numerical values of the sample ACF and PACF, right-click on 
the graphs and select Info. 

The graphs of the sample ACF and PACF sometimes suggest an appropriate 
ARMA model for the data. As a rough guide, if the sample ACF falls between the 
plotted bounds +1.96/,/n for lags h > q, then an MA(q) model is suggested, while 
if the sample PACF falls between the plotted bounds +1.96/,/n for lags h > p, then 
an AR(p) model is suggested. 

If neither the sample ACF nor PACF “cuts off” as in the previous paragraph, a 
more refined model selection technique is required (see the discussion of the AICC 
statistic in Section 5.5.2). Even if the sample ACF or PACF does cut off at some lag, 
it is still advisable to explore models other than those suggested by the sample ACF 
and PACF values. 


Figure D.4 shows the sample ACF of the AIRPASS.TSM series after taking loga- 
rithms, differencing at lags 12 and 1, and subtracting the mean. Figure D.5 shows the 
corresponding sample PACF. These graphs suggest that we consider an MA model 
of order 12 (or perhaps 23) with a large number of zero coefficients, or alternatively 
an AR model of order 12. 


D.3.3 Entering a Model 


A major function of ITSM is to find an ARMA model whose properties reflect to 
a high degree those of an observed (and possibly transformed) time series. Any 
particular causal ARMA(p, q) model with p < 27 and q < 27 can be entered 
directly by choosing Model>Specify, entering the values of p, q, the coefficients, 
and the white noise variance, and clicking OK. If there is a data set already open in 
ITSM, a quick way of entering a reasonably appropriate model is to use the option 
Model>Estimation>Preliminary, which estimates the coefficients and white noise 
variance of an ARMA model after you have specified the orders p and g and selected 
one of the four preliminary estimation algorithms available. An optimal preliminary 
AR model can also be fitted by checking Find AR model with min AICC in the 
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Preliminary Estimation dialog box. If no model is entered or estimated, ITSM 
assumes the default ARMA(0,0), or white noise, model 


Xx; = Zt; 


where {Z,} is an uncorrelated sequence of random variables with mean zero and 
variance 1. 
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If you have data and no particular ARMA model in mind, it is advisable to use 
the option Model>Estimation>Preliminary or equivalently to press the blue PRE 
button at the top of the ITSM window. 

Sometimes you may wish to try a model found in a previous session or a model 
suggested by someone else. In that case choose Model>Specify and enter the re- 
quired model. You can save both the model and data from any project by selecting 
File>Project>Save as and specifying the name for the new file. When the new 
file is opened, both the model and the data will be imported. To create a project with 
this model and a new data set select File>Import File and enter the name of the 
file containing the new data. (This file must contain data only. If it also contains a 
model, then the model will be imported with the data and the model previously in 
ITSM will be overwritten.) 


D.3.4 Preliminary Estimation 


The option Model>Estimation>Preliminary contains fast (but not the most effi- 
cient) model-fitting algorithms. They are useful for suggesting the most promising 
models for the data, but should be followed by maximum likelihood estimation using 
Model>Estimation>Max likelihood. The fitted preliminary model is generally 
used as an initial approximation with which to start the nonlinear optimization car- 
ried out in the course of maximizing the (Gaussian) likelihood. 

To fit an ARMA model of specified order, first enter the values of p and q (see Sec- 
tion 2.6.1). For pure AR models q = 0, and the preliminary estimation option offers a 
choice between the Burg and Yule—Walker estimates. (The Burg estimates frequently 
give higher values of the Gaussian likelihood than the Yule-Walker estimates.) If 
q = 0, you can also check the box Find AR model with min AICC to allow the 
program to fit AR models of orders 0, 1, ... , 27 and select the one with smallest AICC 
value (Section 5.5.2). For models with g > 0, ITSM provides a choice between two 
preliminary estimation methods, one based on the Hannan-Rissanen procedure and 
the other on the innovations algorithm. If you choose the innovations option, a default 
value of m will be displayed on the screen. This parameter was defined in Section 
5.1.3. The standard choice is the default value computed by ITSM. The Hannan- 
Rissanen algorithm is recommended when p and q are both greater than 0, since it 
tends to give causal models more frequently than the innovations method. The latter 
is recommended when p = 0. 

Once the required entries in the Preliminary Estimation dialog box have been 
completed, click OK, and ITSM will quickly estimate the parameters of the selected 
model and display a number of diagnostic statistics. (If p and q are both greater than 
O, it is possible that the fitted model may be noncausal, in which case ITSM sets 
all the coefficients to .001 to ensure the causality required for subsequent maximum 
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likelihood estimation. It will also give you the option of fitting a model of different 
order.) 

Provided that the fitted model is causal, the estimated parameters are given with 
the ratio of each estimate to 1.96 times its standard error. The denominator (1.96 
x standard error) is the critical value (at level .05) for the coefficient. Thus, if the 
ratio is greater than 1 in absolute value, we may conclude (at level .05) that the 
corresponding coefficient is different from zero. On the other hand, a ratio less than 
1 in absolute value suggests the possibility that the corresponding coefficient in the 
model may be zero. (If the innovations option is chosen, the ratios of estimates to 
1.96 x standard error are displayed only when p = q or p = 0.) In the Preliminary 
Estimates window you will also see one or more estimates of the white noise variance 
(the residual sum of squares divided by the sample size is the estimate retained by 
ITSM) and some further diagnostic statistics. These are —2 In L(Q, 6, ô?), where L 
denotes the Gaussian likelihood (5.2.9), and the AICC statistic 


2lnL+2(p+q+1)n/(n—p-—-q-2) 


(see Section 5.5.2). 

Our eventual aim is to find a model with as small an AICC value as possible. 
Smallness of the AICC value computed in the preliminary estimation phase is in- 
dicative of a good model, but should be used only as a rough guide. Final decisions 
between models should be based on maximun likelihood estimation, carried out us- 
ing the option Model>Estimation>Max likelihood, since for fixed p and q, the 
values of @, 0, and o? that minimize the AICC statistic are the maximum likelihood 
estimates, not the preliminary estimates. After completing preliminary estimation, 
ITSM stores the estimated model coefficients and white noise variance. The stored 
estimate of the white noise variance is the sum of squares of the residuals (or one-step 
prediction errors) divided by the number of observations. 

A variety of models should be explored using the preliminary estimation algo- 
rithms, with a view to finding the most likely candidates for minimizing AICC when 
the parameters are reestimated by maximum likelihood. 


To find the minimum-AICC Burg AR model for the logged, differenced, and mean- 
corrected series AIRPASS.TSM currently stored in ITSM, press the blue PRE button, 
set the MA order equal to zero, select Burg and Find AR model with min AICC, 
and then click OK. The minimum-AICC AR model is of order 12 with an AICC value 
of —458.13. To fit a preliminary MA(25) model to the same data, press the blue 
PRE button again, but this time set the AR order to 0, the MA order to 25, select 
Innovations, and click OK. 

The ratios (estimated coefficient)/(1.96 x standard error) indicate that the coeffi- 
cients at lags 1 and 12 are nonzero, as suggested by the sample ACF. The estimated 
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coefficients at lags 3 and 23 also look substantial even though the corresponding 
ratios are less than | in absolute value. The displayed values are as follows: 


MA COEFFICIENTS 


—.3568 .0673 —.1629 —.0415 .1268 
.0264 .0283 —.0648 .1326 —.0762 
—.0066 —.4987 .1789 —.0318 .1476 
—.1461 .0440 —.0226 —.0749 —.0456 
—.0204 —.0085 .2014 —.0767 —.0789 
RATIO OF COEFFICIENTS TO (1.96*STANDARD ERROR) 

—2.0833 .3703 —.8941 —.2251 .6875 
.1423 .1522 —.3487 7124 —.4061 
—.0353 —2.6529 8623 —.1522 .7068 
—.6944 .2076 —.1065 —.3532 —.2147 
—.0960 —.0402 9475 —.3563 —.3659 


The estimated white noise variance is .00115 and the AICC value is —440.93, which 
is not as good as that of the AR(12) model. Later we shall find a subset MA(25) 
model that has a smaller AICC value than both of these models. 


D.3.5 The AICC Statistic 


The AICC statistic for the model with parameters p,q, œ, and @ is defined (see 
Section 5.2.2) as 


AICC(¢, 0) = —2 1n L(¢, 0, S(d, 9)/n) +2(p +q + 1)n/(n— p—q —2), 


and a model chosen according to the AICC criterion minimizes this statistic. 

Model-selection statistics other than AICC are also available inITSM. A Bayesian 
modification of the AIC statistic known as the BIC statistic is also computed in the 
option Model>Estimation>Max likelihood. It is used in the same way as the 
AICC. 

An exhaustive search for a model with minimum AICC or BIC value can be 
very slow. For this reason the sample ACF and PACF and the preliminary estimation 
techniques described above are useful in narrowing down the range of models to 
be considered more carefully in the maximum likelihood estimation stage of model 
fitting. 


D.3.6 Changing Your Model 


The model currently stored by the program can be checked at any time by selecting 
Model>Specify. Any parameter can be changed in the resulting dialog box, including 
the white noise variance. The model can be filed together with the data for later use 
by selecting File>Project>Save as and specifying a file name with suffix .TSM. 
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Example D.3.4. 


We shall now set some of the coefficients in the current model to zero. To do this choose 
Model>Specify and click on the box containing the value —.35676 of Theta(1). Press 
Enter, and the value of Theta(2) will appear in the box. Set this to zero. Press Enter 
again, and the value of Theta(3) will appear. Continue to work through the coefficients, 
setting all except Theta(1), Theta(3), Theta(12), and Theta(23) equal to zero. When 
you have reset the parameters, click OK, and the new model stored in ITSM will be 
the subset MA(23) model 


X, = Z, — .357Z,_, — .163Z,_3 — 499Z,_ 12 + .201Z,_23, 


where {Z,} ~ WN(O, .00115). 


D.3.7 Maximum Likelihood Estimation 


Once you have specified values of p and q and possibly set some coefficients to zero, 
you can carry out efficient parameter estimation by selecting Model>Estimation> 
Max likelihood or equivalently by pressing the blue MLE button. 

The resulting dialog box displays the default settings, which in most cases will 
not need to be modified. However, if you wish to compute the likelihood without 
maximizing it, check the box labeled No optimization. The remaining information 
concerns the optimization settings. (With the default settings, any coefficients that 
are set to zero will be treated as fixed values and not as parameters. Coefficients to 
be optimized must therefore not be set exactly to zero. If you wish to impose further 
constraints on the optimization, press the Constrain optimization button. This 
allows you to fix certain coefficients or to impose multiplicative relationships on the 
coefficients during optimization.) 

To find the maximum likelihood estimates of your parameters, click OK, and the 
estimated parameters will be displayed. To refine the estimates, repeat the estimation, 
specifying a smaller value of the accuracy parameter in the Maximum Likelihood 
dialog box. 


To find the maximum likelihood estimates of the parameters in the model for the 
logged, differenced, and mean-corrected airline passenger data currently stored in 
ITSM, press the blue MLE button and click OK. The following estimated parameters 
and diagnostic statistics will then be displayed: 


ARMA MODEL: 
XO = AA + (—.355) * Z(t — 1) + (—.201) x Zt — 3) + (—.523) x Zt — 12) + (.242) x Z(t — 23) 


WN Variance = .001250 


MA Coefficients 
THETA( 1)= -.355078 THETA( 3)= -.201125 
THETA(1 2)= -.523423 THETA(23)= .241527 
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Standard Error of MA Coefficients 
THETA( 1): .059385 THETA( 3): .059297 
THETA(1 2): .058011 THETA(23): .055828 


(Residual SS)/N = .125024E-02 
AICC = -.486037E+03 
BIC = -.487622E+03 


-2 Ln(Likelihood)= -.496517E+03 
Accuracy parameter = .00205000 
Number of iterations = 5 

Number of function evaluations = 46 


Optimization stopped within accuracy level. 


The last message indicates that the minimum of —21n L has been located with 
the specified accuracy. If you see the message 

Iteration limit exceeded, 
then the minimum of —2 In L could not be located with the number of iterations (50) 
allowed. You can continue the search (starting from the point at which the iterations 
were interrupted) by pressing the MLE button to continue the minimization and 
possibly increasing the maximum number of iterations from 50 to 100. 


D.3.8 Optimization Results 


After maximizing the Gaussian likelihood, ITSM displays the model parameters 
(coefficients and white noise variance), the values of —2 In L, AICC, BIC, and infor- 
mation regarding the computations. 


The next stage of the analysis is to consider a variety of competing models and to 
select the most suitable. The following table shows the AICC statistics for a variety 
of subset moving average models of order less than 24. 


Lags AICC 
1 3 12 23 —486.04 
1 3 12 13 23 —485.78 
1 3 5 12 23 —489.95 
1 3 12 13 —482.62 
1 12 —475.91 


The best of these models from the point of view of AICC value is the one with 
nonzero coefficients at lags 1, 3, 5, 12, and 23. To obtain this model from the one 
currently stored in ITSM, select Model>Specify, change the value of THETA(5) 
from zero to .001, and click OK. Then reoptimize by pressing the blue MLE button 
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and clicking OK. You should obtain the noninvertible model 
X, = Z, — 434Z, — .305Z;_3 + .238Z,-5 — .656Z;-12 + -351 Z,-23, 


where {Z,} ~ WN(O, .00103). For future reference, file the model and data as AIR- 
PASS2.TSM using the option File>Project>Save as. 


The next step is to check our model for goodness of fit. 


D.4 Testing Your Model 


Once we have a model, it is important to check whether it is any good or not. Typi- 
cally this is judged by comparing observations with corresponding predicted values 
obtained from the fitted model. If the fitted model is appropriate then the prediction 
errors should behave in a manner that is consistent with the model. The residuals are 
the rescaled one-step prediction errors, 


W, = (X, - X)/¥ Tr-1,5 


where X, is the best linear mean-square predictor of X, based on the observations up 
to time t — 1, 7,1 = E(X, — X,)*/o? and o? is the white noise variance of the fitted 
model. 

If the data were truly generated by the fitted ARMA (p, q) model with white noise 
sequence {Z,}, then for large samples the properties of {W,} should reflect those of 
{Z,}. To check the appropriateness of the model we therefore examine the residual 
series {W,}, and check that it resembles a realization of a white noise sequence. 

ITSM provides a number of tests for doing this in the Residuals Menu, which 
is obtained by selecting the option Statistics>Residual Analysis. Within this 
option are the suboptions 


Plot 

QQ-Plot (normal) 
QQ-Plot (t-distr) 
Histogram 

ACF/PACF 

ACF Abs vals/Squares 
Tests of randomness 
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D.4.1 Plotting the Residuals 


Select Statistics>Residual Analysis>Histogram, and you will see a histogram 
of the rescaled residuals, defined as 


R, = W,/6, 


where nô? is the sum of the squared residuals. If the fitted model is appropriate, the 
histogram of the rescaled residuals should have mean close to zero. If in addition the 
data are Gaussian, this will be reflected in the shape of the histogram, which should 
then resemble a normal density with mean zero and variance 1. 

Select Statistics>Residual Analysis>Plot and you will see a graph of R, 
vs. t. If the fitted model is appropriate, this should resemble a realization of a white 
noise sequence. Look for trends, cycles, and nonconstant variance, any of which 
suggest that the fitted model is inappropriate. If substantially more than 5% of the 
rescaled residuals lie outside the bounds +1.96 or if there are rescaled residuals far 
outside these bounds, then the fitted model should not be regarded as Gaussian. 

Compatibility of the distribution of the residuals with either the normal distribu- 
tion or the f-distribution can be checked by inspecting the corresponding qq plots and 
checking for approximate linearity. To test for normality, the Jarque-Bera statistic is 
also computed. 


The histogram of the rescaled residuals from our model for the logged, differenced, 
and mean-corrected airline passenger series is shown in Figure D.6. The mean is 
close to zero, and the shape suggests that the assumption of Gaussian white noise is 
not unreasonable in our proposed model. 

The graph of R, vs. t is shown in Figure D.7. A few of the rescaled residuals 
are greater in magnitude than 1.96 (as is to be expected), but there are no obvious 
indications here that the model is inappropriate. The approximate linearity of the 
normal qq plot and the Jarque-Bera test confirm the approximate normality of the 
residuals. 


D.4.2 ACF/PACF of the Residuals 


If we were to assume that our fitted model is the true process generating the data, 
then the observed residuals would be realized values of a white noise sequence. 

In particular, the sample ACF and PACF of the observed residuals should lie 
within the bounds +1.96/,/n roughly 95% of the time. These bounds are displayed 
on the graphs of the ACF and PACF. If substantially more than 5% of the correlations 
are outside these limits, or if there are a few very large values, then we should look 
for a better-fitting model. (More precise bounds, due to Box and Pierce, can be found 
in TSTM Section 9.4.) 
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Figure D-7 

Time plot of the 
rescaled residuals 
from AIRPASS.MOD. 


1 unit = 1 standard deviation 
Choose Statistics>Residual Analysis>ACF/PACF, or equivalently press the 
middle green button at the top of the ITSM window. The sample ACF and PACF 
of the residuals will then appear as shown in Figures D.8 and D.9. No correlations 
are outside the bounds in this case. They appear to be compatible with the hypothesis 
that the residuals are in fact observations of a white noise sequence. To check for 
independence of the residuals, the sample autocorrelation functions of their absolute 
values and squares can be plotted by clicking on the third green button. 
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Figure D-8 
Sample ACF of the residuals 
from AIRPASS.MOD. 
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Figure D-9 
Sample PACF of 
the residuals from 
AIRPASS.MOD. 
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D.4.3 Testing for Randomness of the Residuals 
The option Statistics>Residual Analysis>Tests of Randomness carries out 


the six tests for randomness of the residuals described in Section 5.3.3. 


The residuals from our model for the logged, differenced, and mean-corrected series 
AIRPASS.TSM are checked by selecting the option indicated above and selecting 


1.0 


Lag 
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the parameter h for the portmanteau tests. Adopting the value h = 25 suggested by 
ITSM, we obtain the following results: 


RANDOMNESS TEST STATISTICS (see Section 5.3.3) 


LJUNG-BOX PORTM.= 13.76 CHISQUR( 20), p-value = 0.843 
MCLEOD-LI PORTM.= 17.39 CHISQUR( 25), p-value = 0.867 
TURNING POINTS = 87. ANORMAL( 86.00, 4.79**2), p-value = 0.835 
DIFFERENCE-SIGN = 65. ANORMAL( 65.00, 3.32**2), p-value = 1.000 
RANK TEST = 3934. ANORMAL(4257.50, 251.3**2), p-value = 0.198 
JARQUE-BERA = 4.33 CHISQUR(2) p-value = 0.115 
ORDER OF MIN AICC YW MODEL FOR RESIDUALS = 0 


Every test is easily passed by our fitted model (with significance level a = 
.05), and the order of the minimum-AICC AR model for the residuals supports the 
compatibility of the residuals with white noise. For later use, file the residuals by 
pressing the red EXP button and exporting the residuals to a file with the name 
AIRRES.TSM. 


D.5 Prediction 


One of the main purposes of time series modeling is the prediction of future ob- 
servations. Once you have found a suitable model for your data, you can predict 
future values using the option Forecasting>ARMA. (The other options listed under 
Forecasting refer to the methods of Chapter 9.) 


D.5.1 Forecast Criteria 


Given observations X,..., X„ ofa series that we assume to be appropriately modeled 
as an ARMA(p, q) process, ITSM predicts future values of the series X,,, from the 
data and the model by computing the linear combination P,,(X,,+;,) of Xi,..., Xn that 
minimizes the mean squared error E(Xn4n — Pa(Xn+n))*- 


D.5.2 Forecast Results 


Assuming that the current data set has been adequately fitted by the current 
ARMA(p, q) model, choose Forecasting>ARMA, and you will see the ARMA Fore- 
cast dialog box. 
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You will be asked for the number of forecasts required, which of the transforma- 
tions you wish to invert (the default settings are to invert all of them so as to obtain 
forecasts of the original data), whether or not you wish to plot prediction bounds 
(assuming normality), and if so, the confidence level required, e.g., 95%. After pro- 
viding this information, click OK, and the data will be plotted with the forecasts (and 
possibly prediction bounds) appended. As is to be expected, the separation of the 
prediction bounds increases with the lead time h of the forecast. 

Right-click on the graph, select Info, and the numerical values of the predictors 
and prediction bounds will be printed. 


We left our logged, differenced, and mean-corrected airline passenger data stored in 
ITSM with the subset MA(23) model found in Example D.3.5. To predict the next 
24 values of the original series, select Forecasting>ARMA and accept the default 
settings in the dialog box by clicking OK. You will then see the graph shown in Figure 
D.10. Numerical values of the forecasts are obtained by right-clicking on the graph 
and selecting Info. The ARMA Forecast dialog box also permits using a model 
constructed from a subset of the data to obtain forecasts and prediction bounds for 
the remaining observed values of the series. 


D.6 Model Properties 


Figure D-10 

The original AIRPASS 
data with 24 
forecasts appended. 
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ITSM can be used to analyze the properties of a specified ARMA process without 
reference to any data set. This enables us to explore and compare the properties 
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of different ARMA models in order to gain insight into which models might best 
represent particular features of a given data set. 

For any ARMA(j, q) process or fractionally integrated ARMA(p, q) process 
with p < 27 and q < 27, ITSM allows you to compute the autocorrelation and 
partial autocorrelation functions, the spectral density and distribution functions, and 
the MA(oo) and AR(oo) representations of the process. It also allows you to generate 
simulated realizations of the process driven by either Gaussian or non-Gaussian noise. 
The use of these options is described in this section. 


We shall illustrate the use of ITSM for model analysis using the model for the trans- 
formed series AIRPASS.TSM that is currently stored in the program. 


D.6.1 ARMA Models 


For modeling zero-mean stationary time series, ITSM uses the class of ARMA (and 
fractionally integrated ARMA) processes. ITSM Enables you to compute character- 
istics of the causal ARMA model defined by 


X, Fa iX- F 2X12 Rt sk PpXı-p F Z, F 0i Z1 € 02Z;_9 sees ok Oy Z1—q5 


or more concisely ¢(B)X, = 0(B)Z,, where {Z,} ~ WN (0, o°) and the parame- 
ters are all specified. (Characteristics of the fractionally integrated ARIMA (p, d, q) 
process defined by 


(1 — B)“6(B)X, = O(B)Z,, |d| < 0.5, 


can also be computed.) 

ITSM works exclusively with causal models. It will not permit you to enter a 
model for which 1 — ¢;z — -- - — pz” has a zero inside or on the unit circle, nor does 
it generate fitted models with this property. From the point of view of second-order 
properties, this represents no loss of generality (Section 3.1). If you are trying to 
enter an ARMA(p, q) model manually, the simplest way to ensure that your model 
is causal is to set all the autoregressive coefficients close to zero (e.g., .001). ITSM 
will not accept a noncausal model. 

ITSM does not restrict models to be invertible. You can check whether or not 
the current model is invertible by choosing Model>Specify and pressing the button 
labeled Causal/Invertible in the resulting dialog box. If the model is noninvertible, 
i.e., if the moving-average polynomial 1 + 061z + --- + 6,z? has a zero inside or on 
the unit circle, the message Non-invertible will appear beneath the box contain- 
ing the moving-average coefficients. (A noninvertible model can be converted to an 
invertible model with the same autocovariance function by choosing Model>Switch 
to invertible. If the model is already invertible, the program will tell you.) 
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Figure D-11 

The ACF of the model in 
Example D.3.5 together 
with the sample ACF 

of the transformed 
AIRPASS.TSM series. 


D.6.2 Model ACF, PACF 


The model ACF and PACF are plotted using Model>ACF/PACF>Model. If you wish 
to change the maximum lag from the default value of 40, select Model>ACF/PACF> 
Specify Lag and enter the required maximum lag. (It can be much larger than 40, 
e.g., 10000). The graph will then be modified, showing the correlations up to the 
specified maximum lag. 

If there is a data file open as well as a model in ITSM, the model ACF and PACF 
can be compared with the sample ACF and PACF by pressing the third yellow button 
at the top of the ITSM window. The model correlations will then be plotted in red, 
with the corresponding sample correlations shown in the same graph but plotted in 
green. 


The sample and model ACF and PACF for the current model and transformed series 
AIRPASS.TSM are shown in Figures D.11 and D.12. They are obtained by pressing 
the third yellow button at the top of the ITSM window. The vertical lines represent the 
model values, and the squares are the sample ACF/PACF. The graphs show that the 
data and the model ACF both have large values at lag 12, while the sample and model 
partial autocorrelation functions both tend to die away geometrically after the peak at 
lag 12. The similarities between the graphs indicate that the model is capturing some 
of the important features of the data. 
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Figure D-12 

The PACF of the model in 
Example D.3.5 together 
with the sample PACF 

of the transformed 
AIRPASS.TSM series. 
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D.6.3 Model Representations 


As indicated in Section 3.1, if {X,} is a causal ARMA process, then it has an MA(oo) 
representation 


CO 
eS OU FSU Die) 
j=0 


where >> |y;| < œ and yo = 1. 
j=0 
Similarly, if {X,} is an invertible ARMA process, then it has an AR(oo) repre- 
sentation 


[o0] 
Ze. PSO ge yt 
j=0 


where Xi |7;| < œ and mo = 1. 

For any specified causal ARMA model you can determine the coefficients in these 
representations by selecting the option Mode1>AR/MA Infinity. (If the model is not 
invertible, you will see only the MA (cc) coefficients, since the AR(oo) representation 
does not exist in this case.) 


The current subset MA(23) model for the transformed series AIRPASS.TSM does 
not have an AR(oo) representation, since it is not invertible. However, we can replace 
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the model with an invertible one having the same autocovariance function by select- 
ing Model>Switch to Invertible. For this model we can then find an AR(oo) 
representation by selecting Model>AR Infinity. This gives 50 coefficients, the first 
20 of which are shown below. 


MA-Infinity AR — Infinity 


j psi) pity) 
0 1.00000 1.00000 
1 —.36251 36251 
2 01163 11978 
3 —.26346 30267 
4 —.06924 .27307 
5 15484.  —.00272 
6 —.02380 05155 
7 —.06557 16727 
8 —.04487 10285 
9 01921 .01856 
10 —.00113 .07947 
11 01882 .07000 
12 —.57008 58144 
13 .00617 41683 
14 .00695 .23490 
15 .03188 37200 
16 02778 38961 
17 01417 10918 
18 02502 .08776 
19 .00958 22791 


D.6.4 Generating Realizations of a Random Series 


ITSM can be used to generate realizations of a random time series defined by the 
currently stored model. 

To generate such a realization, select the option Model>Simulate, and you will 
see the ARMA Simulation dialog box. You will be asked to specify the number of 
observations required, the white noise variance (if you wish to change it from the cur- 
rent value), and an integer-valued random number seed (by specifying and recording 
this integer with up to nine digits you can reproduce the same realization at a later 
date by reentering the same seed). You will also have the opportunity to add a spec- 
ified mean to the simulated ARMA values. If the current model has been fitted to 
transformed data, then you can also choose to apply the inverse transformations to the 
simulated ARMA to generate a simulated version of the original series. The default 
distribution for the white noise is Gaussian. However, by pressing the button Change 
noise distribution you can select from a variety of alternative distributions or 
by checking the box Use Garch model for noise process you can generate an 
ARMA process driven by GARCH noise. Finally, you can choose whether the sim- 
ulated data will overwrite the data set in the current project or whether they will be 
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used to create a new project. Once you are satisfied with your choices, click OK, and 
the simulated series will be generated. 


Example D.6.4 To generate a simulated realization of the series AIRPASS.TSM using the current 
model and transformed data set, select the option Model>Simulate. The default 
options in the dialog box are such as to generate a realization of the original series 
as a new project, so it suffices to click OK. You will then see a graph of the simulated 
series that should resemble the original series AIRPASS.TSM. 


D.6.5 Spectral Properties 


Spectral properties of both data and fitted ARMA models can also be computed and 
plotted with the aid of ITSM. The spectral density of the model is determined by 
selecting the option Spectrum>Model. Estimation of the spectral density from ob- 
servations of a stationary series can be carried out in two ways, either by fitting an 
ARMA model as already described and computing the spectral density of the fitted 
model (Section 4.4) or by computing the periodogram of the data and smoothing (Sec- 
tion 4.2). The latter method is applied by selecting the option Spectrum>Smoothed 
Periodogram. Examples of both approaches are given in Chapter 4. 


D.7 Multivariate Time Series 


Observations {x), ..., X,} of an m-component time series must be stored as an ASCII 
file with n rows and m columns, with at least one space between entries in the same 
row. To open a multivariate series for analysis, select File>Project>Open>Multi- 
variate and click OK. Then double-click on the file containing the data, and you will 
be asked to enter the number of columns (m) in the data file. After doing this, click 
OK, and you will see graphs of each component of the series, with the multivariate 
tool bar at the top of the ITSM screen. For examples of the application of ITSM to 
the analysis of multivariate series, see Chapter 7. 
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(DJAO2.TSM, DJAOPC2.TSM) 
225-226, 248, 251 
Durbin-Levinson algorithm, 69, 142 


E 


EM algorithm, 289-292 

Monte Carlo (MCEM), 298 
embedded discrete-time process, 359 
error probabilities, 389-390 

type I, 389 

type II, 389 
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estimation of missing values 

in an ARIMA process, 287 

in an AR(p) process, 288 

in a state-space model, 286 
estimation of the white noise variance 

least squares, 161 

maximum likelihood, 160 

using Burg’s algorithm, 148 

using the Hannan-Rissanen algorithm, 

157 

using the innovations algorithm, 155 

using the Yule-Walker equations, 142 
expectation, 373 
exponential distribution, 370 
exponential family models, 301-302 
exponential smoothing, 27—28, 322 


F 


filter (see linear filter) 
Fisher information matrix, 387 
forecasting, 63-77, 167-169 (see also 
prediction) 
forecasting ARIMA processes, 198-203 
forecast function, 200-203 
h-step predictor, 199 
mean square error of, 200 
forecast density, 293 
forward prediction errors, 147 
Fourier frequencies, 122 
Fourier indices, 13 
fractionally integrated ARMA process, 
361 
estimation of, 363 
spectral density of, 363 
Whittle likelihood approximation, 363 
fractionally integrated white noise, 362 
autocovariance of, 362 
variance of, 362 
frequency domain, 111 


G 


gamma distribution, 371 

gamma function, 371 

GARCH(p, q) process, 352-357 
ARMA model with GARCH noise, 356 
fitting GARCH models, 353-356 
Gaussian-driven, 354 
generalizations, 356 
regression with GARCH noise, 356 


t-driven, 355 
Gaussian likelihood 
in time series context, 387 
of a CAR(1) process, 359 
of a multivariate AR process, 246 
of an ARMA(p, q) process, 160 
with missing observations, 284—285, 
290 
of GARCH model, 354 
of regression with ARMA errors, 213 
Gaussian linear process, 344 
Gaussian time series, 380 
Gauss-Markov theorem, 385 
generalized distribution function, 115 
generalized least squares (GLS) 
estimation, 212, 386 
generalized inverse, 272, 312 
generalized state-space models 
Bayesian, 292 
filtering, 293 
forecast density, 293 
observation-driven, 299-311 
parameter-driven, 292—299 
prediction, 293 
Gibbs phenomenon, 131 
goals scored by England against 
Scotland, 306-311 
goodness of fit (see also tests of 
randomness) based on ACF, 21 


H 


Hannan-Rissanen algorithm, 156 

harmonic regression, 12—13 

Hessian matrix, 161, 214 

hidden process, 293 

Holt-Winters algorithm, 322-326 
seasonal, 326—328 

hypothesis testing, 389-391 
large-sample tests based on confidence 

regions, 390-391 

uniformly most powerful test, 390 


I 


independent random variables, 375 
identification techniques, 187-193 
for ARMA processes, 161, 169-174, 
189 
for AR(p) processes, 141 
for MA(q) processes, 152 


for seasonal ARIMA processes, 206 
iid noise, 8, 16 
sample ACF of, 61 
multivariate, 232 
innovations, 82, 273 
innovations algorithm, 73-75, 150-151 
fitted innovations MA(m) model, 151 
multivariate, 246 
input, 51 
intervention analysis, 340-343 
invertible 
ARMA process, 86 
multivariate ARMA process, 243 
Ité integral, 358 
ITSM, 31, 32, 43, 44, 81, 87, 95, 188, 
333, 337-339, 395-421 


J 


joint distributions of a time series, 7 
joint distribution of a random vector, 374 


K 


Kalman recursions 
filtering, 271, 276 
prediction, 271, 273 
h-step, 274 
smoothing, 271, 277 
Kullback-Leibler discrepancy, 171 
Kullback-Leibler index, 172 


L 


Lake Huron (LAKE.TSM), 10-11, 
21-23, 63, 149-150, 155, 157, 163, 
174, 193, 215-217, 291 
latent process, 293 
large-sample tests based on confidence 
regions, 390-391 
least squares estimation 
for ARMA processes, 161 
for regression model, 383-386 
for transfer function models, 333-335 
of trend, 10 
likelihood function, 386 (see also 
Gaussian likelihood) 
linear combination of sinusoids, 116 
linear difference equations, 201 
linear filter, 26, 42, 51 
input, 51 
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linear filter (cont.) 
low-pass, 26, 130 
moving-average, 31, 42 
output, 51 
simple moving-average, 129 
linear process, 51, 232 
ACVF of, 52 
Gaussian, 344 
multivariate, 232 
linear regression (see regression) 
local level model, 264 
local linear trend model, 266 
logistic equation, 345 
long memory, 318, 362 
long-memory model, 361-365 


M 


MA(1) process, 17 
ACF of, 17, 48 
estimation of missing values, 82 
moment estimation, 145 
noninvertible, 97 
order selection, 152 
PACF of, 110 
sample ACF of, 61 
spectral density of, 120 
state-space representation of, 312 
MA (q) (see moving average process) 
MA(oo), 51 
multivariate, 233 
martingale difference sequence, 343 
maximum likelihood estimation, 
158-161, 386-387 
ARMA processes, 160 
large-sample distribution of, 162 
confidence regions for, 161 
mean 
of a multivariate time series, 224 
estimation of, 234 
of a random variable, 373 
of a random vector, 376 
estimation of, 58 
sample, 57 
large-sample properties of, 58 
mean square convergence, 393-394 
properties of, 394 
measurement error, 98 
memory shortening, 318 
method of moments estimation, 96, 140 


minimum AICC AR model, 167 
mink trappings (APPH.TSM), 257 
missing values in ARMA processes 
estimation of, 286 
likelihood calculation with, 284 
mixture distribution, 372 
Monte Carlo EM algorithm (MCEM), 
298 
moving average (MA(q)) process, 50 
ACF of, 89 
sample, 94 
ACVF of, 89 
estimation 
confidence intervals, 152 
Hannan-Rissanen, 156 
innovations, 150-151 
maximum likelihood, 160, 162 
order selection, 151, 152 
partial autocorrelation of, 96 
unit roots in, 196-198 
multivariate AR process 
estimation, 247—249 
Burg’s algorithm, 248 
maximum likelihood, 246-247 
Whittle’s algorithm, 247 
forecasting, 250-254 


error covariance matrix of prediction, 


251 
multivariate ARMA process, 241-244 
causal, 242 
covariance matrix function of, 244 
estimation 
maximum likelihood, 246—247 
invertible, 243 
prediction, 244-246 


error covariance matrix of prediction, 


252 
multivariate innovations algorithm, 246 
multivariate normal distribution, 378 
bivariate, 379-380 
conditional distribution, 380 
conditional expectation, 380 
density function, 378 
definition, 378 
singular, 378 
standardized, 378 
multivariate time series, 223 
covariance matrices of, 229, 230 
mean vectors of, 229, 230 
second-order properties of, 229-234 


stationary, 230 
multivariate white noise, 232 
muskrat trappings (APPI.TSM), 257 


N 


negative binomial distribution, 372, 381 
NILE.TSM, 363-365 

NOISE.TSM, 334, 343 

nonlinear models, 343—357 
nonnegative definite matrix, 376 
nonnegative definite function, 47 
normal distribution, 370, 373 

normal equations, 384 

null hypothesis, 389 


o 


observation equation, 260 
of CARMA (p, q) model, 359 
ordinary least squares (OLS) estimators, 
211, 383-385 
one-step predictors, 71, 273 
order selection, 141, 161, 169-174 
AIC, 171 
AICC, 141, 161, 173, 191, 247, 407 
BIC, 173, 408 
consistent, 173 
efficient, 173 
FPE, 170-171 
orthogonal increment process, 117 
orthonormal set, 123 
overdifferencing, 196 
overdispersed, 306 
overshorts (OQSHORTS.TSM), 96-99, 
167, 197, 215 
structural model for, 98 


P 


partial autocorrelation function (PACF), 
71, 94-96 
estimation of, 95 
of an AR(p) process, 95 
of an MA(1) process, 96 
sample, 95 
periodogram, 123-127 
approximate distribution of, 124 
point estimate, 388 
Poisson distribution, 371, 374 
model, 302 
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polynomial fitting, 28 
population of USA (USPOP.TSM), 6, 9, 
30 
portmanteau test for residuals (see tests 
of randomness) 
posterior distribution, 294 
power function, 390 
power steady model, 305 
prediction of stationary processes (see 
also recursive prediction) 
AR(p) processes, 102 
ARIMA processes, 198—203 
ARMA processes, 100-108 
based on infinite past, 75-77 
best linear predictor, 46 
Gaussian processes, 108 
prediction bounds, 108 
large-sample approximations, 107 
MA (q) processes, 102 
multivariate AR processes, 250-254 
one-step predictors, 69 
mean squared error of, 105 
seasonal ARIMA processes, 208-210 
prediction operator, 67 
properties of, 68 
preliminary transformations, 187 
prewhitening, 237 
prior distribution, 294 
probability density function (pdf), 370 
probability generating function, 381 
probability mass function (pmf), 370 
purely nondeterministic, 78, 343 


Q 
q-dependent, 50 
q-correlated, 50 
qq plot, 38 

R 


R and S arrays, 180 
random noise component, 23 
random variable 
continuous, 370 
discrete, 370 
randomly varying trend and seasonality 
with noise, 267, 326 
random vector, 374—377 
covariance matrix of, 376 
joint distribution of, 374 


mean of, 376 
probability density of, 375 
random walk, 8, 17 
simple symmetric, 9 
with noise, 263, 274, 280 
rational spectral density (see spectral 
density function) 
realization of a time series, 7 
recursive prediction 
Durbin-Levinson algorithm, 69, 245 
Innovations algorithm, 71-75, 246 
Kalman prediction (see Kalman 
recursions) 
multivariate processes 
Durbin-Levinson algorithm, 245 
innovations algorithm, 246 
regression 
with ARMA errors, 210-219 
best linear unbiased estimator, 212 
Cochrane and Orcutt procedure, 212 
GLS estimation, 212 
OLS estimation, 211 
rejection region, 389 
RES.TSM, 343 
residuals, 35, 164 
check for normality, 38, 167 
graph of, 165 
rescaled, 164 
sample ACF of, 166 
tests of randomness for, 166 


S 


sales with leading indicator (LS2.TSM, 
SALES.TSM, LEAD.TSM), 228, 
238-241, 248-249, 335, 338 
sample 
autocorrelation function, 16—21 
MA (q), 94 
of residuals, 166 
autocovariance function, 19 
covariance matrix, 19 
mean, 19 
large-sample properties of, 58 
multivariate, 230 
partial autocorrelation, 95 
SARIMA (see seasonal ARIMA process) 
seasonal adjustment, 6 


seasonal ARIMA process, 203-210 
forecasting, 208-210 
mean squared error of, 209 
maximum likelihood estimation, 206 
seasonal component, 23, 301, 404 
estimation of 
method S1, 31 
elimination of 
method S2, 33 
seat-belt legislation (SBL.TSM, 
SBL2.TSM), 217-219, 341-343 
second-order properties, 7 
in frequency domain, 233 
short memory, 318, 362 
SIGNAL.TSM, 3 
signal detection, 3 
significance level, 390 
size of a test, 390 
smoothing 
by elimination of high-frequency 
components, 28 
with a moving average filter, 25 
exponential, 27-28, 323 
the periodogram (see spectral density 
estimation) 
using a simple moving average, 129 
spectral density estimation 
discrete spectral average, 125 
large-sample properties of, 126 
rational, 132 
spectral density function, 111-116 
characterization of, 113-114 
of an ARMA(1, 1), 134 
of an ARMA process, 132 
of an AR(1), 118-119 
of an AR(2), 133 
of an MA(1), 119-120 
of white noise, 118 
properties of, 112 
rational, 132 
spectral density matrix function, 233 
spectral distribution function, 116 
spectral representation 
of an autocovariance function, 115 
of a covariance matrix function, 233 
of a stationary multivariate time series, 
233 
of a stationary time series, 117 
Spencer’s 15-point moving average, 27, 
42 
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state equation, 260 
of CARMA(p, q) model, 359 
stable, 263 
state-space model, 259-316 
estimation for, 277—283 
stable, 263 
stationary, 263 
with missing observations, 283—288 
state-space representation, 261 
causal AR(p), 267-268 
causal ARMA(p, q), 268 
ARIMA(p, d, q), 269-271 
stationarity 
multivariate, 230 
strict, 15, 52 
weak, 15 
steady-state solution, 275, 324 
stochastic differential equation 
first-order, 357 
pth-order, 359 
stochastic volatility, 349, 353, 355 
stock market indices (STOCK7.TSM), 
257, 367 
strictly stationary series, 15, 49 
properties of, 49 
strikes in the U.S.A. (STRIKES.TSM), 6, 
25, 28, 43, 110 
structural time series models, 98, 263 
level model, 263-265 
local linear trend model, 265, 323 
randomly varying trend and seasonality 
with noise, 267, 326 
estimation of, 277-286 
seasonal series with noise, 266 
sunspot numbers (SUNSPOTS.TSM), 81, 
99, 127, 135, 174, 344, 356 


T 


testing for the independence of two 
stationary time series, 237—241 
test for normality, 38, 167 


tests of randomness 
based on sample ACF, 36 
based on turning points, 36-37, 167 
difference-sign test, 37, 167 
Jarque-Bera normality test, 38, 167 
minimum AICC AR model, 167 
portmanteau tests, 36, 166, 352 
Ljung-Box, 36, 167, 352 
McLeod-Li, 36, 167, 352 
rank test, 37—38, 167 
third-order central moment, 347 
third-order cumulant function, 347, 366 
of linear process, 347, 360 
threshold model, 348 
AR(p), 349 
time domain, 111 
time-invariant linear filter (TLF), 
127-132 
causal, 127 
transfer function, 128 
time series, 1, 6 
continuous-time, 2 
discrete-time, 1 
Gaussian, 47 
time series model, 7 
time series of counts, 297—299 
transfer function, 129 
transfer function model, 331-339 
estimation of, 333-335 
prediction of, 337-339 
transformations, 23, 187—188 
variance-stabilizing, 187 
tree-ring widths (TRINGS.TSM), 367 
trend component, 9—12 
elimination of 
in absence of seasonality, 23-30 
by differencing, 29-30 
estimation of 
by elimination of high-frequency 
components, 28 
by exponential smoothing, 27—28 


by least squares, 10 

by polynomial fitting, 29 

by smoothing with a moving average, 
25, 31 


U 


uniform distribution, 370, 371 
discrete, 371 
uniformly most powerful (UMP) test, 390 
unit roots 
augmented Dickey-Fuller test, 195 
Dickey-Fuller test, 194 
in autoregression, 194-196 
in moving-averages, 196-198 
likelihood ratio test, 197 
locally best invariant unbiased (LBIU) 
test, 198 


Vv 


variance, 373 
volatility, 349, 353, 355 


W 


weight function, 125 
white noise, 16, 232, 405 
multivariate, 232 
spectral density of, 118 
Whittle approximation to likelihood, 363 
Wold decomposition, 77, 343 


Y 


Yule-Walker estimation (see also 
autoregressive process and 
multivariate AR process), 139 

for q > 0, 145 


zoom buttons, 398 
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